A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments

Yang, Jinrong; Gao, Fang; Yun, Teng; Zhu, Tong; Zhu, Huaixi; Zhou, Ran; Wang, Yikun

doi:10.3390/electronics14142805

Open AccessArticle

A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments

by

Jinrong Yang

^1,2

,

Fang Gao

^1,2,

Teng Yun

^1,2,

Tong Zhu

^1,2

,

Huaixi Zhu

^1,2,

Ran Zhou

^1,2 and

Yikun Wang

^1,2,*

¹

School of Physics and Photoelectric Engineering, Key Laboratory of Gravitational Wave Precision Measurement of Zhejiang Province, Taiji Laboratory for Gravitational Wave Universe, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2805; https://doi.org/10.3390/electronics14142805

Submission received: 11 June 2025 / Revised: 7 July 2025 / Accepted: 9 July 2025 / Published: 12 July 2025

Download

Browse Figures

Versions Notes

Abstract

Chinese traditional instruments are diverse and encompass a rich variety of timbres and rhythms, presenting considerable research potential. This work proposed a deep-learning framework for the automated classification of Chinese traditional instruments, addressing the challenges of acoustic diversity and cultural preservation. By integrating two datasets, CTIS and ChMusic, we constructed a combined dataset comprising four instrument families: wind, percussion, plucked string, and bowed string. Three time-frequency features, namely MFCC, CQT, and Chroma, were extracted to capture diverse sound information. A convolutional neural network architecture was designed, incorporating 3-channel spectrogram feature stacking and a hybrid channel–spatial attention mechanism to enhance the extraction of critical frequency bands and feature weights. Experimental results demonstrated that the feature-fusion method improved classification performance compared to a single feature as input. Meanwhile, the attention mechanism further boosted test accuracy to 98.79%, outperforming baseline models by 2.8% and achieving superior F1 scores and recall compared to classical architectures. Ablation study confirmed the contribution of attention mechanisms. This work validates the efficacy of deep learning in preserving intangible cultural heritage through precise analysis, offering a feasible methodology for the classification of Chinese traditional instruments.

Keywords:

deep learning; Chinese traditional instruments; instrument classification

1. Introduction

Music is a common art form in daily life. In recent years, the rapid advancements in artificial intelligence and information technologies have propelled the field of music information retrieval (MIR) into the spotlight, garnering widespread attention and research interest. Classification tasks, such as instrument and genre classification, represent a key research direction in MIR, with applications spanning music data management, automatic music transcription, and music recommendation systems. The primary objective of instrument classification is to automatically identify and categorize the sounds of different instruments by analyzing and processing audio data, thereby offering robust technical support for music research, education, and industrial applications [1].

Although people can usually easily distinguish the playing styles of different musical instruments, for certain instruments, especially those belonging to the same family, such as the trumpet and cornet in Western instruments, their playing sounds may exhibit a high similarity in timbre. This similarity makes it challenging for ordinary listeners to accurately distinguish between them. In addition, the intricate structural information inherent in music further complicates audio processing tasks. In MIR, instrument classification is one of the most popular research directions, and Western researchers have been conducting research in this field for decades [2,3]. Early studies relied primarily on the physical characteristics of musical signals, such as audio timbre features [4,5] and pitch features [6]. With the expansion of new methods and advancements in technology, researchers have made further breakthroughs in classification tasks. Agostini et al. [7] achieved nearly 70% classification accuracy using spectral features based on the Support Vector Machine (SVM) method, while Barbedo et al. [8] achieved around 70% classification accuracy for two public datasets of instruments using pitch class. Although these methods have achieved relatively good instrument classification results to some extent, they are limited by the complexity of manually extracting features. Subsequently, researchers gradually shifted their focus from studying different machine learning methods to extracting audio features. The research by Deng et al. [9] indicated that one of the most critical tasks in instrument classification is to extract the most suitable audio features, while Muller et al. [10] demonstrated that audio features commonly used in speech signal processing can also perform well in music processing. Seema et al. [11] utilized different speech signal processing methods to extract features such as MFCC, SC, and ZCR for comparative experiments, showing that these features all have good effects on instrument classification, and different instruments also have their own applicable features. Prabavathy et al. [12] achieved good classification results for 16 types of instruments using SVM and KNN. Thus, machine learning methods employing various audio features have gradually emerged as the primary approach in the field of musical instrument classification. By utilizing suitable features and classifiers, these methods have achieved remarkable progress in the task of classifying Western musical instruments.

In recent years, the rise of deep-learning technology has brought new changes to classification tasks. In audio classification tasks, deep-learning methods have shown significant advantages. Many researchers have achieved excellent classification accuracy on large audio datasets such as IRMAS [13,14] and GTZAN [15,16]. Deep-learning models, such as a convolutional neural network (CNN) and a recurrent neural network (RNN), can automatically extract multi-level features from the original audio data and learn complex time-frequency relations. Based on CNN, Giri et al. combined and analyzed various types of audio on the Kaggle dataset, achieving excellent musical instrument classification results [17]. Additionally, the popular Transformer architecture, which has gained popularity in the image domain, has also been introduced into the MIR field [18,19], yielding promising results. However, some researchers also found that for audio deep learning, the training results of neural network input of original audio are not as good as those of manual feature extraction training [20]. Therefore, after entering the era of deep learning, feature engineering has indeed become a very critical task.

However, despite the significant research achievements of deep learning in the classification of Western musical instruments, its application in the classification of Chinese traditional instruments is still in its infancy. Chinese traditional music is an important part of the Chinese nation and also a significant component of the world cultural heritage. Unlike Western traditional instruments such as the saxophone and piano, Chinese traditional instruments are often made from natural materials, utilizing a rich variety of raw materials like silk and bamboo, which makes Chinese traditional instruments diverse and their playing styles varied [21]. Additionally, the timbre of Chinese traditional instruments is usually soft, delicate, subtle, and melodious, embodying a strong Eastern cultural charm. Instruments such as the Erhu, Pipa, and Guzheng require the expression of long, gentle, and weighted emotions during performance. In contrast, the timbre of Western instruments is typically bright, crisp, and resonant, exuding a modern atmosphere. Instruments like the violin, piano, and saxophone need to convey clear, loud, and splendid emotions during performance [22]. This difference in emotional expression arises not only from the physical characteristics of the instruments themselves but is also closely related to the historical background and aesthetic concepts of Eastern and Western musical cultures. Western music emphasizes the precision of harmony and rhythm, highlighting strong emotional expression, while Chinese traditional music focuses more on the fluidity of melody and the creation of artistic conception, pursuing a subtle and profound aesthetic. In recent years, some researchers have conducted research on Chinese traditional instruments in the field of MIR and have achieved certain results. Cao [23] classified several Chinese traditional instruments using Deep Belief Networks (DBN), achieving an accuracy of 70.2%. Xu et al. [24] collected relevant music data of Chinese folk instruments from the Internet and achieved a classification accuracy of 91% using attention-based Bidirectional Gated Recurrent Units (BI-GRU) with MFCC features. Li et al. [25] designed an 8-layer convolutional neural network based on a dataset of Chinese folk instruments, achieving excellent instrument recognition and performance technique recognition. Their research also pointed out that while deeper networks can achieve better recognition results on the test set, the accuracy of a 4-layer CNN was about 4% lower than that of an 8-layer CNN.

Based on the existing research, the potential for improvement in deep-learning applications within this domain remains evident. Now, there are still some challenges in classifying Chinese traditional instruments. Bowed string instruments like the Erhu and Matouqin exhibit highly overlapping spectral features due to similar playing techniques and materials, resulting in highly similar spectral envelopes and thus high inter-class confusion. This requires the model to capture subtle spectral differences that traditional single features struggle to distinguish. Besides, the overtone distribution of Chinese musical instruments made of natural materials such as silk and bamboo is denser and more delicate, and the slight difference in overtone band position and intensity has a great impact on the sense of hearing. In order to further tap the musical characteristics and classification potential of Chinese traditional instruments, we need to continue to promote the classification of Chinese national traditional instruments and explore more efficient and accurate in-depth learning methods. Notably, most existing studies focus on single-feature approaches, lacking a systematic exploration of multi-feature-fusion strategies. In addition, the way of multi-feature fusion varies considerably. Therefore, our future research will focus on not only using three single features as input, but also studying their two combination methods. One is concat fusion, while the other is stacking, and using a neural network based on an attention mechanism to improve the classification performance. This study aims to promote the application of deep learning in the classification of Chinese traditional instruments. By systematically reviewing the development history of musical instrument classification techniques, a deep-learning method for musical instrument classification is proposed. The main contributions of this paper are as follows:

(1): The audio signals of Chinese traditional instruments contain multi-dimensional information, rendering their effective classification a challenging task. To address this challenge, we propose a neural network architecture specifically designed for the classification of Chinese traditional instruments.
(2): We introduce channel attention and spatial attention mechanisms to enhance the channel information and key frequency bands of the input features, thereby addressing the limitation of convolutional neural networks, which extract all feature information uniformly through fixed convolution kernels and struggle to distinguish important features.
(3): We integrate two datasets of Chinese traditional instruments and conduct comprehensive experiments with the proposed model. By using various different features and their combinations as inputs, and compared to classical models, we validate the model’s generalization capability and robustness.

The other sections of the article are as follows: Section 2 introduces the dataset of Chinese traditional instruments used, the designed model, the features used, and the processing methods of the features; Section 3 describes the experimental setup and results; Section 4 organizes and analyzes the experimental results; Section 5 summarizes the entire research content and further looks forward to future work.

2. Materials and Methods

2.1. Dataset

In our research, we organized and sorted out the Chinese traditional instruments dataset. CTIS is a dataset of Chinese traditional instruments released by the China Conservatory of Music [26]. The dataset comprises a total of 216 traditional instruments categorized into four major musical instrument families based on their playing techniques, namely Wind, Percussion, Plucked String, and Bowed String. Each type of musical instrument is performed in a professional recording environment by performers and music teachers, including a variety of performance styles. In addition, in order to enhance the diversity of data samples in this experiment, we supplemented it with the ChMusic dataset. ChMusic is a high-quality dataset of Chinese traditional instruments published by Shandong University of Science and Technology [27]. It contains the performance tracks of 11 traditional musical instruments, which are also performed by professional musicians. After the combination, the number of dataset files and the number of musical instruments we obtained are shown in Figure 1. A total of more than 5000 audio files and 216 different types of traditional instruments have been obtained, with all files labeled according to the audio content of the dataset.

Figure 2 shows the duration of the audio contained in each instrument family in the data set. Therefore, we obtain the audio data with a total duration of 125,049 s, both of which are mono and at a 44.1 kHz sample rate. These audio data include not only performances of classic pieces corresponding to traditional instruments but also segments of unique playing techniques for each instrument. Each category of samples contains different playing speeds, dynamics, and techniques for that instrument, with each instrument including at least 5 traditional playing techniques such as sweeping (Pipa) and plucking (Erhu), covering performance scenarios like solo and ensemble.

2.2. Dataset Preprocess

The audio file durations provided by the combined dataset used in this study range from 3 s to a few files at 500 s, as shown in Figure 3, where the performance technique files play the sound of the instrument every few seconds, while the performance music files continuously emit the sound of the instrument from start to finish. For music data, short time interception will lead to information loss, but too long an interception time will lead to sparse data. Therefore, we set the duration of audio file interception to 5 s. For a few audio files with a duration of less than 5 s, we have filled them, which will not affect the amount of information they contain.

2.3. Feature Extraction

In the field of MIR, the selection of audio features is of crucial importance for the performance of classification tasks. A wide variety of audio features each possess their own specialties in different audio processing areas. MFCC is derived from further extraction of the Mel spectrum and represents cepstral coefficients in the frequency domain. It can simulate the human auditory system and analyze audio signals in a way that is closer to human perception. Although it originated from speech recognition tasks, numerous researchers have verified that it still exhibits excellent performance in the music field. For example, S Rajesh et al. once used MFCC to accomplish the task of instrument emotion recognition with a relatively high accuracy rate [28]. Therefore, we also choose MFCC as one of the features to be extracted. CQT adopts logarithmic frequency resolution, resulting in a frequency distribution consistent with the natural frequency distribution of musical scales. This characteristic makes CQT particularly suitable for analyzing the rich overtone structure of Chinese traditional instruments. In recent years, it has been widely applied in the MIR field. For instance, Yang et al. utilized CQT features to accomplish music emotion recognition tasks [29]. The Chroma feature utilizes pitch information, and its covered frequency band also includes the pitch distribution of Chinese traditional instruments, demonstrating good performance for musical tasks. For example, Shi et al. used the Chroma feature to complete the task of music genre classification on the GTZAN dataset [30]. Based on these findings, in our work, we also take it as one of the features.

2.3.1. MFCC

MFCC (Mel-Frequency Cepstral Coefficients) were introduced as early as the last century [31], representing an audio feature grounded in auditory system research. As a fundamental technique in speech signal processing, MFCC has maintained enduring relevance across decades of research while continuing to enable key advancements in related domains.

MFCC serves as a computational feature, as shown in Figure 4. First, the raw signal undergoes preprocessing, including pre-emphasis, framing, and windowing operations. The preprocessed signal needs to be converted to frequency-domain signals through the FFT (Fast Fourier Transform). Since low-frequency sounds travel longer distances on the basilar membrane of the inner ear compared to high-frequency sounds, bass is generally more easily masked than treble, while treble masking bass is relatively more difficult. In the low-frequency range, the critical bandwidth for sound masking is narrower, whereas it is wider in the high-frequency range. Therefore, a set of Mel filters is typically configured to filter the input signal based on the variation of the critical bandwidth; the energy of the signal output from each filter can be used as input for the trained model after undergoing DCT (Discrete Cosine Transform) and logarithmic transformation. Figure 5 shows four MFCC spectrograms from the dataset.

2.3.2. CQT

CQT (Constant Q Transform) is a time-frequency analysis method based on the center frequency of an exponential distribution. It can be viewed as a form of wavelet transform, where the ratio of the center frequency to the bandwidth of its filter bank is constant. Its calculation is given by Equation (1):

X [k] = \frac{1}{N [k]} \sum_{n = 0}^{N [k] - 1} x [n] \cdot w_{k} [n] e^{- j \frac{2 π Q}{N_{[k]}} n}

(1)

where

X [k]

is the CQT coefficient of the k-th frequency band,

x [n]

is the input audio signal,

w_{k} [n]

is the window function corresponding to the k-th frequency band,

N [k]

is the window length of the k-th frequency band related to the center frequency

f_{k}

, and Q is the constant factor, defined by Equation (2):

Q = \frac{f_{k}}{δ f_{k}}

(2)

where

δ k

is the frequency resolution. Unlike FFT, the horizontal axis of the CQT spectrum uses a logarithmic scale, and the filter window length can be dynamically adjusted according to frequency, thus achieving an optimized balance between time and frequency resolution [32]. In the CQT spectrogram, brightness represents the intensity or energy of frequency components; the higher the brightness, the stronger the signal energy at that frequency and time point; conversely, lower brightness indicates weaker energy. Figure 6 shows four CQT spectrograms from the dataset.

2.3.3. Chroma

As one of the classic representations of audio features, Chroma serves to extract pitch contours. Chroma features map the spectral energy of the original audio signal onto 12 fixed pitch classes, namely

(C, C #, D, D #, E, F, F #, G, G #, A, A #, B)

. It can not only represent the pitch information in the melodies of the four types of musical instruments proposed in our dataset well, but also demonstrates unique representational capabilities for non-linear tonal interactions. The processing begins with applying a Hanning window to the original audio signal, followed by Short Time Fourier Transform (STFT) to acquire the time-frequency representation information

F (t, f)

. Subsequently, calculate its power spectrum. The calculation formula is as follows:

X (t, f) = \int_{- \infty}^{\infty} x (τ) \cdot w (t - τ) \cdot e^{- j 2 π f τ} d τ

(3)

Then, project the power spectrum energy onto 12 chroma frequency bands through a triangular filter bank and normalize it to obtain a chroma vector represented over time. Figure 7 shows four Chroma spectrograms from the dataset.

2.3.4. Feature Fusion

Feature Fusion refers to the combination of features from different sources or representations to improve model performance. In deep learning, the goal of feature fusion is to leverage the complementary information of different features to augment model expressivity and improve outcomes in classification, regression, or other downstream tasks. It is considered an effective method for enhancing the training effectiveness of neural network models, as confirmed by the work of many researchers [33,34,35]. Considering the complementary and associative information among different features, we need to perform frequency alignment on the extracted features to ensure they are combined on the same frequency scale. Next, we apply

M i n - M a x

normalization to these features to eliminate the dimensional differences between the two types of features. Finally, we concatenate these normalized features along the first frequency axis to form fused features, which serve as the input to the model.

The three kinds of features we use have their own directions of audio feature information representation. The multi-resolution characteristics of CQT are highly consistent with the characteristics of musical instruments with a wide range of voices in the combined data set, while MFCC can accurately depict the unique timbre information of each traditional musical instrument. These texture features yield effective discriminative power for instrument classification. In addition, Chroma demonstrates superior recognition efficacy on the melody information of musical instrument frequencies. Through the fusion of the three above features, although the calculation dimension is added on the basis, the included frequency spectrum envelope, harmonic structure, and other frequency-domain information are improved, which can more accurately describe the complex acoustic characteristics of Chinese ethnic musical instruments. At the same time, the fusion method is simple, the calculation cost is relatively low, and it has stronger robustness.

2.3.5. Stacking Features

While concatenating the three features along the frequency axis is a convenient method, this method will damage the spatial structure of the spectrum in the feature. In order to avoid losing the local correlation of the time-frequency plane of the feature in concat method, we use the method of stacking three feature spectrograms, i.e., each audio file forms a 128 × 128 spectrogram format after extracting audio features, and then completes stacking, forming a three channel mode similar to RGB images, i.e., the size is (3 × 128 × 128), as shown in Figure 8. In this way, we transform the traditional audio sequence information into the information in the form of pictures, and transform the deep learning of audio into the deep learning of pictures. This not only realizes the information synthesis of multiple features, but also the feature format can be more suitable for the processing of convolutional neural networks.

2.4. Model

With the continuous development and breakthroughs of artificial intelligence over the past few decades, convolutional neural networks are considered one of the best deep networks and have been validated in much image-processing research [36]. The pixel values in images exhibit strong local correlations, a characteristic that similarly applies to information contained in audio spectrograms. By stacking three types of features, audio information is transformed into a structure akin to images. This allows CNNs to directly apply their powerful convolution and pooling mechanisms without requiring additional data transformation steps, thus further enhancing computational efficiency compared to traditional machine learning models such as SVM [37].

2.4.1. Baseline Architecture

The classic architecture VGGNet from the CNN family has been proven to be a deep neural network with excellent performance many years ago [38]. Inspired by VGGNet, we designed a convolutional neural network model for this work.

As shown in Figure 9, our convolutional neural network model consists of three convolutional layers. The first convolutional layer is constructed with 3 × 3 convolution kernels, capturing low-level feature patterns in the time-frequency domain through local receptive fields. Subsequently, a batch normalization layer stabilizes the feature distribution. We enhance the non-linear expressiveness by introducing ReLU, the activation function between these two layers, while the 2 × 2 max-pooling layer performs down-sampling in the feature space, reducing the amount of data transfer between layers and thus accelerating computation speed while retaining important information. Similarly, the second convolutional layer is constructed with 3 × 3 convolution kernels, expanding the channel dimension to 64. The third convolutional layer, also built with 3 × 3 convolution kernels, further enlarges the receptive field, increasing the number of channels to 128 to capture high-level time-frequency correlation characteristics of instrumental features. A global average pooling layer is used at the output of the convolutional layers, effectively compressing the feature dimensions, and a Dropout layer is introduced between the two fully connected layers to enhance the model’s generalization ability. We selected the LogSoftmax function as the activation function due to its faster backpropagation speed [39], reducing computational complexity, with its calculation formula shown in Equation (4). The final output is a probability distribution over 4 categories of instruments.

L o g S o f t m a x (x_{i}) = l o g (\frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}})

(4)

where

x_{i}

is the i-th element of the input vector.

2.4.2. Attention Mechanism

To further enhance the model’s ability to interact with audio signals in a sequential manner and better process the stacked features, we introduced an attention mechanism module, which consists of two sub-modules: Channel attention and spatial attention. As shown in Figure 10, it presents the overall task framework, where ⊗ denotes element-wise multiplication. By embedding the attention sub-modules in the convolutional layers, the original feature maps are processed through channel attention and spatial attention. This not only strengthens the correlation of multi-channel spectra but also enables the model to focus on the spatial information of the input features.

Channel Attention: The main purpose of this module is to establish inter-connections among different channels to adaptively adjust the weight distribution. The original feature spectrogram undergoes dimension permutation to adapt to the subsequent processing by the fully connected layers. Two linear layers obtain high-order statistical characteristics through dimensionality reduction and activation reconstruction. Finally, channel weights are generated via the sigmoid function, and element-by-channel multiplication is performed with the original features to enhance the feature responses among different channels.

Spatial Attention: As can be seen from the figure, the channel attention models the importance of the spatial positions of the feature spectrogram. First, the features enhanced by channel attention are divided into independent sub-spaces, and then grouped convolutions are performed to extract local spatial features. Subsequently, the ReLU function is used to enhance the non-linear expressiveness, and then standard convolutions are employed to aggregate the global spatial information. The aggregated information generates spatial weights through the sigmoid function. Finally, the feature spectrogram that has passed through channel attention is weighted in the spatial dimension, achieving the enhancement of the concentrated regions of spectral energy.

3. Experiment and Result Analysis

3.1. Experimental Setup

In the experiment, the combined dataset we used was divided into a training set, a test set, and a validation set. For each family of instruments, the data were partitioned according to a ratio of 8:1:1. The training set contained approximately 100,039 s of audio data, while both the test set and the validation set contained approximately 12,505 s of audio data each. We used the Librosa [40] to extract the features required for this experiment, selecting appropriate parameters. For each audio frame, we selected 13 MFCC coefficients. This number strikes a balance between feature representation ability and device computational efficiency. An excessive number of MFCC coefficients can capture more details, but it will increase the computational burden and introduce noise. If the number of coefficients is too small, although it saves computational time, it may lead to a loss of some information and reduce the training effect. A 50% overlapping sliding window was used to extract audio features to ensure temporal continuity. The number of bins for CQT extraction was set to 84, which can fully cover the main frequency range of the instrument signals in the dataset. The number of chroma scales was kept at 12 scales.

In this experiment, we utilized the Adam optimizer with an initial learning rate of 0.001, a batch size of 32, set the number of epochs to 100, and adopted CrossEntropyLoss as the loss function. The model was trained on an Intel(R) Core(TM) i9-13900KF with a clock frequency of 3.0 GHz, equipped with an Nvidia A5000 GPU and 64 GB of DDR5 RAM. The training environment is Pytorch based on Linux Ubuntu 24.04.

3.2. Performance Evaluation

Instrument classification essentially falls under the classification problem in deep learning, so it is appropriate to use Recall, Precision, and F1 score as metrics to measure the overall performance of our work [41], and their respective values can intuitively inform us about the effectiveness of the experiments. The calculation method for Recall is given by Equation (5):

Recall = \frac{T P}{T P + F N}

(5)

The calculation method for Precision is given by Equation (6):

Precision = \frac{T P}{T P + F P}

(6)

The calculation method for the F1 score is given by Equation (7):

F 1 Score = \frac{2 T P}{2 T P + F P + F N}

(7)

In the above equation,

T P

(True Positive) refers to the classes correctly predicted by the model,

F P

(False Positive) refers to other classes incorrectly identified as the correct class,

F N

(False Negative) refers to the correct class identified as other classes, and

T N

(True Negative) refers to other classes correctly predicted. Therefore, we also used a confusion matrix containing the above elements as an indicator, which is a square matrix that clearly shows the categories that the model can correctly identify in the experiment and displays the incorrectly predicted category labels [42]. This is very helpful for us to discover instruments with similar musical characteristics. After the experiment is completed, we will use these metrics to evaluate the performance of the methods we proposed.

3.3. Results and Analysis

In our research, we set up multiple groups of comparative experiments. Specifically, we compared the baseline with single features to the baseline with fused features. This was aimed at demonstrating, on one hand, the effectiveness of feature fusion in enhancing the performance of instrument classification. On the other hand, it was to prove that the stacking structure we employed could further improve the classification performance based on feature fusion. Subsequently, we replaced the baseline CNN with the CNN incorporating an attention mechanism, and conducted comparative experiments with several classic model architectures. To better evaluate the relative performance of these architectures, all the classic models used were trained with the same settings as our model. Finally, through ablation experiments, we demonstrated that the method we proposed could indeed improve the classification performance.

3.3.1. Comparison of Baseline Using Single Features or Fused Features

We respectively used MFCC, CQT, Chroma feature, and their fused features as input on the baseline we proposed. The results are shown in Table 1. As expected, the features fused by the concat method MFCC&CQT&Chroma do contribute to the improvement of classification performance; each metric shows an improvement when compared to a single feature. In addition, from the results of using the stacked spectrograms of the three features as input, it can be seen that the accuracy, precision, F1 score, and recall have all been further improved compared to concat method, indicating that the proposed stacking method can indeed enable the convolutional neural network to utilize audio feature information more effectively.

3.3.2. Comparison of Classic Models Using Stacking Features

As is evident from the experiment in Table 1, the best classification results are obtained when using stacking features as input. Therefore, in our research, we used stacking features as input and compared the baseline CNN with an attention mechanism against different classic models. The results are shown in Table 2, indicating that our method achieves superior performance. The convolutional neural network embedded with an attention mechanism can, to a certain extent, improve the classification performance of Chinese traditional instruments compared to a normal neural network.

The experimental results show that, compared with classic neural networks, our model not only has the best classification performance but also keeps the training duration within a proper range. The simple CNN has the shortest training time, yet its classification performance is relatively the weakest. VGG16 also shows excellent classification performance, but its training time is much longer compared to other architectures.

3.3.3. Ablation Study

We conducted ablation experiments on the proposed convolutional neural network based on the attention mechanism. With consistent parameters, we tested the impact of each module on the overall accuracy. As shown in Table 3, we tested four cases. S1 of the component represents the channel attention module, and S2 represents the spatial attention module. A checkmark ✔ indicates that the module was used, while a cross × indicates that the module was not used.

Case 1, which adopted the baseline CNN, achieved the lowest classification accuracy. When using either the channel attention or the spatial attention module alone, the classification accuracy increased, with the channel attention module demonstrating a more significant improvement in classification performance. When both modules were used in combination, the highest classification accuracy was obtained. Compared to the baseline, the accuracy of Case 4 increased by 2.8%, and there were also increases to varying degrees in recall, precision, and F1 score. These results indicate that the method proposed in our research is effective. Figure 11 shows the confusion matrix. It could be seen that spatial attention and channel attention individually contributed to a reduction in instrument misclassification rates, while their combination achieved a further performance enhancement.

4. Discussion

In this study, datasets of two types of Chinese traditional instruments were collected and combined. A convolutional neural network was employed as the model to explore the classification performance of four categories of musical instruments when using three features, namely MFCC, CQT, and Chroma, along with their fused features as inputs. MFCC mimics the auditory characteristics of the human ear via the Mel filter bank, effectively capturing the global timbre features of musical instruments. CQT uses a logarithmic frequency resolution, enhancing low-frequency resolution while preserving high-frequency details, which is of particular significance for analyzing the rich overtone series of Chinese traditional instruments. Chroma focuses more on the energy distribution of the fundamental wave and harmonics, and is excellent at distinguishing musical instruments with pattern differences. By fusing these three features through concatenation and stacking, the classification performance of traditional musical instruments can be effectively improved compared to using a single feature as input. Furthermore, unlike the simple concat method that flattens multi-dimensional features, the stacking method preserves the spatial relationships within every feature type. This arrangement allows the CNN’s convolutional layers to capture local dependencies more effectively, as each kernel can operate within the inherent frequency-time structure of acoustic features. This spatial preservation enhances the network’s ability to learn hierarchical patterns that are critical for instrument classification.

From the results of ablation experiments, we can see that the model without an attention mechanism has some limitations in the classification between “Bowed String” and “Percussion” or “Plugged String”. First of all, although these two types of instruments have different vocal modes (friction and pluck), they are both string instruments, and their fundamental and harmonic structures are similar. Especially in the middle and high-frequency regions, the spectral characteristics of the two may overlap, which makes it difficult to distinguish between models. Secondly, there is a large number of Percussion instruments, which form a very large instrument family, including instruments with fixed pitch and those without fixed pitch. This diversity leads to significant differences in their sound characteristics. The spectral characteristics of some percussion instruments may overlap with those of string or wind instruments, thus increasing the difficulty of classification. Given that a normal convolutional neural network is not highly discerning for important features, we introduced an attention mechanism for weighted processing to highlight and focus on important areas in the audio feature spectrogram. Our attention mechanism significantly improves classification performance through the effect of channel attention and spatial attention when stacked features are used as input. Channel attention retains cross-dimensional information through 3D permutation and uses multi-layer perceptrons to enhance inter-channel dependencies, thereby adaptively adjusting channel weights and emphasizing critical frequency bands. Spatial attention, on the other hand, fuses spatial information through convolutional layers and removes max-pooling to retain more feature map details, focusing on the spatial importance of the feature spectrum and enhancing regions with concentrated spectral energy. The research revealed that this approach improved the classification performance of the four types of musical instruments, indicating that the attention mechanism in this research met the expected requirements in terms of audio feature representation as well as weight distribution in space and channels.

5. Conclusions

For a long time, in the field of music information retrieval, whether it is musical instrument classification or identification tasks, existing research has mainly focused on Western musical instruments, with a small number of traditional ethnic musical instruments from other countries or regions included. However, research on Chinese traditional instruments has been relatively scarce. Therefore, we conducted a study on the classification of Chinese traditional instruments. We combined two datasets, ChMusic and CTIS, to construct a dataset containing four families of traditional instruments. Through a series of experiments, we evaluated the effectiveness of three classic frequency-domain features, MFCC, CQT, and Chroma, as well as their fused features in this classification task. The experimental results show that frequency-domain features can provide rich spectral information, enabling the model to capture subtle differences in playing techniques and thus improving the accuracy of the musical instrument classification task.

By adopting fused features as input and a convolutional neural network based on the attention mechanism, on one hand, multi-dimensional audio information is effectively integrated, compensating for the insufficiency of single-feature information. On the other hand, the attention mechanism endows the convolutional neural network with the ability to focus on important channels and regions while suppressing redundant information.

We achieved some classification results on the combined dataset. Through multiple groups of comparative experiments, we verified that a convolutional neural network with a properly configured attention mechanism can effectively classify Chinese traditional instruments, demonstrating superior classification performance compared to classic architectures such as ResNet18. In addition, this study also emphasizes the crucial role of feature selection in the classification task, i.e., choosing different feature forms as inputs has a significant impact on the precision of classification.

Although the three single features and their fused features used have achieved excellent classification results, future research can explore audio features and their processing methods that are more suitable for the characteristics of Chinese traditional instruments to further improve classification performance. It is also possible to expand the dataset with more Chinese traditional instruments. Meanwhile, with the advancement of future devices, more complex model architectures also hold great research potential. Our work also demonstrates the potential practical application value, such as the ability to classify low-quality recordings for the protection of traditional music cultural heritage or as an auxiliary teaching tool for traditional instruments. Next, we will attempt to further optimize the research content, delving deeper into the musical instrument classification research from various perspectives, considering lightweight design or a more fine-grained classification, such as specific traditional musical instruments.

Author Contributions

Conceptualization, J.Y. and Y.W.; methodology, J.Y.; software, J.Y.; validation, T.Y. and H.Z.; investigation, T.Z. and T.Y.; resources, R.Z. and T.Y.; writing—original draft preparation, J.Y. and F.G.; writing—review and editing, Y.W.; supervision, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gorbunova, I.; Hiner, H. Music computer technologies and interactive systems of education in digital age school. In Proceedings of the International Conference Communicative Strategies of Information Society (CSIS), Saint-Petersburg, Russia, 26–27 October 2018; Atlantis Press: Paris, France, 2019; pp. 124–128. [Google Scholar] [CrossRef]
Marques, J.; Moreno, P.J. A study of musical instrument classification using gaussian mixture models and support vector machines. Camb. Res. Lab. Tech. Rep. Ser. CRL 1999, 4, 143. [Google Scholar]
Kostek, B.; Czyzewski, A. Representing musical instrument sounds for their automatic classification. J.-Audio Eng. Soc. 2001, 49, 768–785. [Google Scholar]
Loureiro, M.A.; de Paula, H.B.; Yehia, H.C. Timbre Classification of A Single Musical Instrument. In Proceedings of the ISMIR, Barcelona, Spain, 10–14 October 2004. [Google Scholar] [CrossRef]
Herrera-Boyer, P.; Peeters, G.; Dubnov, S. Automatic classification of musical instrument sounds. J. New Music Res. 2003, 32, 3–21. [Google Scholar] [CrossRef]
Herrera-Boyer, P.; Klapuri, A.; Davy, M. Automatic classification of pitched musical instrument sounds. In Signal Processing Methods for Music Transcription; Springer: Boston, MA, USA, 2006; pp. 163–200. [Google Scholar]
Agostini, G.; Longari, M.; Pollastri, E. Musical instrument timbres classification with spectral features. EURASIP J. Adv. Signal Process. 2003, 2003, 943279. [Google Scholar] [CrossRef]
Barbedo, J.G.A.; Tzanetakis, G. Musical instrument classification using individual partials. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 111–122. [Google Scholar] [CrossRef]
Deng, J.D.; Simmermacher, C.; Cranefield, S. A study on feature analysis for musical instrument classification. IEEE Trans. Syst. Man, Cybern. Part B (Cybern.) 2008, 38, 429–438. [Google Scholar] [CrossRef]
Muller, M.; Ellis, D.P.; Klapuri, A.; Richard, G. Signal processing for music analysis. IEEE J. Sel. Top. Signal Process. 2011, 5, 1088–1110. [Google Scholar] [CrossRef]
Ghisingh, S.; Mittal, V.K. Classifying musical instruments using speech signal processing methods. In Proceedings of the 2016 IEEE Annual India Conference (INDICON), Bangalore, India, 16–18 December 2016; pp. 1–6. [Google Scholar] [CrossRef]
Prabavathy, S.; Rathikarani, V.; Dhanalakshmi, P. Classification of Musical Instruments using SVM and KNN. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 1186–1190. [Google Scholar] [CrossRef]
Guo, R. Research on Neural Network-based Automatic Music Multi-Instrument Classification Approach. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 792–798. [Google Scholar] [CrossRef]
Avramidis, K.; Kratimenos, A.; Garoufis, C.; Zlatintsi, A.; Maragos, P. Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3010–3014. [Google Scholar] [CrossRef]
Senac, C.; Pellegrini, T.; Mouret, F.; Pinquier, J. Music feature maps with convolutional neural networks for music genre classification. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, Florence, Italy, 19–21 June 2017; pp. 1–5. [Google Scholar] [CrossRef]
Deng, X. Music Genre Classification and Recognition Using Improved Deep Convolutional Neural Network-DenseNet-II. In Proceedings of the 2024 Second International Conference on Data Science and Information System (ICDSIS), Hassan, India, 17–18 May 2024; pp. 1–4. [Google Scholar]
Giri, G.A.V.M.; Radhitya, M.L. Musical instrument classification using audio features and convolutional neural network. J. Appl. Inform. Comput. 2024, 8, 226–234. [Google Scholar] [CrossRef]
Reghunath, L.C.; Rajan, R. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP J. Audio Speech Music Process. 2022, 2022, 11. [Google Scholar] [CrossRef]
Zhuang, Y.; Chen, Y.; Zheng, J. Music genre classification with transformer classifier. In Proceedings of the 2020 4th International Conference on Digital Signal Processing, Chengdu, China, 19–21 June 2020; pp. 155–159. [Google Scholar]
Variani, E.; Sainath, T.N.; Shafran, I.; Bacchiani, M. Complex linear projection (CLP): A discriminative approach to joint feature extraction and acoustic modeling. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 808–812. [Google Scholar] [CrossRef]
Stock, J.P. Reviewed Work: Chinese Musical Instruments: An Introduction Yang Mu. Br. J. Ethnomusicol. 1993, 2, 153–156. [Google Scholar]
Shao, B. Differences between Chinese and Western cultures from the perspective of musical instrument timbre. J. Zibo Univ. Social Sci. Ed. 1999, 4, 80–83. [Google Scholar]
Cao, P. Identification and classification of Chinese traditional musical instruments based on deep learning algorithm. In Proceedings of the 2nd International Conference on Computing and Data Science, Stanford, CA, USA, 28–30 January 2021; pp. 1–5. [Google Scholar] [CrossRef]
Xu, K. Recognition and classification model of music genres and Chinese traditional musical instruments based on deep neural networks. Sci. Program. 2021, 2021, 2348494. [Google Scholar] [CrossRef]
Li, R.; Zhang, Q. Audio recognition of Chinese traditional instruments based on machine learning. Cogn. Comput. Syst. 2022, 4, 108–115. [Google Scholar] [CrossRef]
Gong, X.; Zhu, Y.; Zhu, H.; Wei, H. Chmusic: A traditional Chinese music dataset for evaluation of instrument recognition. In Proceedings of the 4th International Conference on Big Data Technologies, Zibo, China, 24–26 September 2021; pp. 184–189. [Google Scholar] [CrossRef]
Liang, X.; Li, Z.; Liu, J.; Li, W.; Zhu, J.; Han, B. Constructing a multimedia Chinese musical instrument database. In Proceedings of the 6th Conference on Sound and Music Technology (CSMT) Revised Selected Papers, Xiamen, China, 24–26 November 2018; Springer: Singapore, 2019; pp. 53–60. [Google Scholar] [CrossRef]
Rajesh, S.; Nalini, N. Musical instrument emotion recognition using deep recurrent neural network. Procedia Comput. Sci. 2020, 167, 16–25. [Google Scholar] [CrossRef]
Yang, P.T.; Kuang, S.M.; Wu, C.C.; Hsu, J.L. Predicting music emotion by using convolutional neural network. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 19–24 July 2020; Springer: Cham, Switzerland, 2020; pp. 266–275. [Google Scholar] [CrossRef]
Shi, L.; Li, C.; Tian, L. Music genre classification based on chroma features and deep learning. In Proceedings of the 2019 Tenth International Conference on Intelligent Control and Information Processing (ICICIP), Marrakesh, Morocco, 14–19 December 2019; pp. 81–86. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Sun, J.; Li, H.; Lei, L. Key detection through pitch class distribution model and ANN. In Proceedings of the 2009 16th International Conference on Digital Signal Processing, Santorini, Greece, 5–7 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
Sharma, D.; Taran, S.; Pandey, A. A fusion way of feature extraction for automatic categorization of music genres. Multimed. Tools Appl. 2023, 82, 25015–25038. [Google Scholar] [CrossRef]
Wu, M. Music Emotion Classification Model Based on Multi Feature Image Fusion. In Proceedings of the 2024 First International Conference on Software, Systems and Information Technology (SSITCON), Tumkur, India, 18–19 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chang, P.C.; Chen, Y.S.; Lee, C.H. IIOF: Intra-and Inter-feature orthogonal fusion of local and global features for music emotion recognition. Pattern Recognit. 2024, 148, 110200. [Google Scholar] [CrossRef]
Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional Neural Network (CNN) for Image Detection and Recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar] [CrossRef]
Costa, Y.M.; Oliveira, L.S.; Silla, C.N., Jr. An evaluation of convolutional neural networks for music classification using spectrograms. Appl. Soft Comput. 2017, 52, 28–38. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
De Brebisson, A.; Vincent, P. An exploration of softmax alternatives belonging to the spherical loss family. arXiv 2015, arXiv:1511.05042. [Google Scholar] [CrossRef]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. SciPy 2015, 2015, 18–24. [Google Scholar]
Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar] [CrossRef]
Krstinić, D.; Braović, M.; Šerić, L.; Božić-Štulić, D. Multi-label classifier performance evaluation with confusion matrix. Comput. Sci. Inf. Technol. 2020, 1, 1–14. [Google Scholar] [CrossRef]

Figure 1. Number of audio files and types for each instrument family.

Figure 2. Total Duration of musical instruments.

Figure 3. The number of files with different durations.

Figure 4. MFCC extraction process.

Figure 5. MFCC spectrograms of some audio in the dataset.

Figure 6. CQT spectrograms of some audio in the dataset.

Figure 7. Chroma spectrograms of some audio in the dataset.

Figure 8. 3-channel stacking spectrogram.

Figure 9. Proposed baseline CNN.

Figure 10. Proposed CNN with attention mechanism.

Figure 11. The classification accuracy of the test set in each case.

Table 1. Comparison of different features as input.

Input	Accuracy	Recall	Precision	F1 Score
MFCC	0.9258	0.9256	0.9251	0.9286
CQT	0.9198	0.9198	0.9273	0.9193
Chroma	0.7996	0.7894	0.7974	0.7922
MFCC&CQT&Chroma	0.9417	0.9397	0.9392	0.9366
Stacking	0.9591	0.9562	0.9615	0.9590

Table 2. Comparison of classic models using stacking features.

Model	Accuracy	Recall	Precision	F1 Score	Time(s)
Ours	0.9879	0.9858	0.9884	0.9859	2566.54
ResNet18	0.9278	0.9378	0.9266	0.9207	2396.88
ResNet50	0.9740	0.9749	0.9748	0.9733	3177.17
VGG16	0.9820	0.9815	0.9813	0.9813	4379.65
CNN	0.9399	0.9352	0.9274	0.9306	2025.73
DenseNet	0.9771	0.9757	0.9768	0.9782	3365.25
ViT	0.9798	0.9779	0.9783	0.9775	4534.31

Table 3. Experimental results under different cases.

Case	Component		Evaluation
Case	S1	S2	Accuracy	Recall	Precision	F1 Score
1	×	×	0.9591	0.9562	0.9615	0.9590
2	✔	×	0.9698	0.9708	0.9674	0.9699
3	×	✔	0.9737	0.9719	0.9783	0.9759
4	✔	✔	0.9879	0.9858	0.9884	0.9859

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Gao, F.; Yun, T.; Zhu, T.; Zhu, H.; Zhou, R.; Wang, Y. A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics 2025, 14, 2805. https://doi.org/10.3390/electronics14142805

AMA Style

Yang J, Gao F, Yun T, Zhu T, Zhu H, Zhou R, Wang Y. A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics. 2025; 14(14):2805. https://doi.org/10.3390/electronics14142805

Chicago/Turabian Style

Yang, Jinrong, Fang Gao, Teng Yun, Tong Zhu, Huaixi Zhu, Ran Zhou, and Yikun Wang. 2025. "A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments" Electronics 14, no. 14: 2805. https://doi.org/10.3390/electronics14142805

APA Style

Yang, J., Gao, F., Yun, T., Zhu, T., Zhu, H., Zhou, R., & Wang, Y. (2025). A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics, 14(14), 2805. https://doi.org/10.3390/electronics14142805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Dataset Preprocess

2.3. Feature Extraction

2.3.1. MFCC

2.3.2. CQT

2.3.3. Chroma

2.3.4. Feature Fusion

2.3.5. Stacking Features

2.4. Model

2.4.1. Baseline Architecture

2.4.2. Attention Mechanism

3. Experiment and Result Analysis

3.1. Experimental Setup

3.2. Performance Evaluation

3.3. Results and Analysis

3.3.1. Comparison of Baseline Using Single Features or Fused Features

3.3.2. Comparison of Classic Models Using Stacking Features

3.3.3. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI