A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments
Abstract
1. Introduction
- (1)
- The audio signals of Chinese traditional instruments contain multi-dimensional information, rendering their effective classification a challenging task. To address this challenge, we propose a neural network architecture specifically designed for the classification of Chinese traditional instruments.
- (2)
- We introduce channel attention and spatial attention mechanisms to enhance the channel information and key frequency bands of the input features, thereby addressing the limitation of convolutional neural networks, which extract all feature information uniformly through fixed convolution kernels and struggle to distinguish important features.
- (3)
- We integrate two datasets of Chinese traditional instruments and conduct comprehensive experiments with the proposed model. By using various different features and their combinations as inputs, and compared to classical models, we validate the model’s generalization capability and robustness.
2. Materials and Methods
2.1. Dataset
2.2. Dataset Preprocess
2.3. Feature Extraction
2.3.1. MFCC
2.3.2. CQT
2.3.3. Chroma
2.3.4. Feature Fusion
2.3.5. Stacking Features
2.4. Model
2.4.1. Baseline Architecture
2.4.2. Attention Mechanism
3. Experiment and Result Analysis
3.1. Experimental Setup
3.2. Performance Evaluation
3.3. Results and Analysis
3.3.1. Comparison of Baseline Using Single Features or Fused Features
3.3.2. Comparison of Classic Models Using Stacking Features
3.3.3. Ablation Study
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Gorbunova, I.; Hiner, H. Music computer technologies and interactive systems of education in digital age school. In Proceedings of the International Conference Communicative Strategies of Information Society (CSIS), Saint-Petersburg, Russia, 26–27 October 2018; Atlantis Press: Paris, France, 2019; pp. 124–128. [Google Scholar] [CrossRef]
- Marques, J.; Moreno, P.J. A study of musical instrument classification using gaussian mixture models and support vector machines. Camb. Res. Lab. Tech. Rep. Ser. CRL 1999, 4, 143. [Google Scholar]
- Kostek, B.; Czyzewski, A. Representing musical instrument sounds for their automatic classification. J.-Audio Eng. Soc. 2001, 49, 768–785. [Google Scholar]
- Loureiro, M.A.; de Paula, H.B.; Yehia, H.C. Timbre Classification of A Single Musical Instrument. In Proceedings of the ISMIR, Barcelona, Spain, 10–14 October 2004. [Google Scholar] [CrossRef]
- Herrera-Boyer, P.; Peeters, G.; Dubnov, S. Automatic classification of musical instrument sounds. J. New Music Res. 2003, 32, 3–21. [Google Scholar] [CrossRef]
- Herrera-Boyer, P.; Klapuri, A.; Davy, M. Automatic classification of pitched musical instrument sounds. In Signal Processing Methods for Music Transcription; Springer: Boston, MA, USA, 2006; pp. 163–200. [Google Scholar]
- Agostini, G.; Longari, M.; Pollastri, E. Musical instrument timbres classification with spectral features. EURASIP J. Adv. Signal Process. 2003, 2003, 943279. [Google Scholar] [CrossRef]
- Barbedo, J.G.A.; Tzanetakis, G. Musical instrument classification using individual partials. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 111–122. [Google Scholar] [CrossRef]
- Deng, J.D.; Simmermacher, C.; Cranefield, S. A study on feature analysis for musical instrument classification. IEEE Trans. Syst. Man, Cybern. Part B (Cybern.) 2008, 38, 429–438. [Google Scholar] [CrossRef]
- Muller, M.; Ellis, D.P.; Klapuri, A.; Richard, G. Signal processing for music analysis. IEEE J. Sel. Top. Signal Process. 2011, 5, 1088–1110. [Google Scholar] [CrossRef]
- Ghisingh, S.; Mittal, V.K. Classifying musical instruments using speech signal processing methods. In Proceedings of the 2016 IEEE Annual India Conference (INDICON), Bangalore, India, 16–18 December 2016; pp. 1–6. [Google Scholar] [CrossRef]
- Prabavathy, S.; Rathikarani, V.; Dhanalakshmi, P. Classification of Musical Instruments using SVM and KNN. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 1186–1190. [Google Scholar] [CrossRef]
- Guo, R. Research on Neural Network-based Automatic Music Multi-Instrument Classification Approach. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 792–798. [Google Scholar] [CrossRef]
- Avramidis, K.; Kratimenos, A.; Garoufis, C.; Zlatintsi, A.; Maragos, P. Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3010–3014. [Google Scholar] [CrossRef]
- Senac, C.; Pellegrini, T.; Mouret, F.; Pinquier, J. Music feature maps with convolutional neural networks for music genre classification. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, Florence, Italy, 19–21 June 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Deng, X. Music Genre Classification and Recognition Using Improved Deep Convolutional Neural Network-DenseNet-II. In Proceedings of the 2024 Second International Conference on Data Science and Information System (ICDSIS), Hassan, India, 17–18 May 2024; pp. 1–4. [Google Scholar]
- Giri, G.A.V.M.; Radhitya, M.L. Musical instrument classification using audio features and convolutional neural network. J. Appl. Inform. Comput. 2024, 8, 226–234. [Google Scholar] [CrossRef]
- Reghunath, L.C.; Rajan, R. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP J. Audio Speech Music Process. 2022, 2022, 11. [Google Scholar] [CrossRef]
- Zhuang, Y.; Chen, Y.; Zheng, J. Music genre classification with transformer classifier. In Proceedings of the 2020 4th International Conference on Digital Signal Processing, Chengdu, China, 19–21 June 2020; pp. 155–159. [Google Scholar]
- Variani, E.; Sainath, T.N.; Shafran, I.; Bacchiani, M. Complex linear projection (CLP): A discriminative approach to joint feature extraction and acoustic modeling. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 808–812. [Google Scholar] [CrossRef]
- Stock, J.P. Reviewed Work: Chinese Musical Instruments: An Introduction Yang Mu. Br. J. Ethnomusicol. 1993, 2, 153–156. [Google Scholar]
- Shao, B. Differences between Chinese and Western cultures from the perspective of musical instrument timbre. J. Zibo Univ. Social Sci. Ed. 1999, 4, 80–83. [Google Scholar]
- Cao, P. Identification and classification of Chinese traditional musical instruments based on deep learning algorithm. In Proceedings of the 2nd International Conference on Computing and Data Science, Stanford, CA, USA, 28–30 January 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Xu, K. Recognition and classification model of music genres and Chinese traditional musical instruments based on deep neural networks. Sci. Program. 2021, 2021, 2348494. [Google Scholar] [CrossRef]
- Li, R.; Zhang, Q. Audio recognition of Chinese traditional instruments based on machine learning. Cogn. Comput. Syst. 2022, 4, 108–115. [Google Scholar] [CrossRef]
- Gong, X.; Zhu, Y.; Zhu, H.; Wei, H. Chmusic: A traditional Chinese music dataset for evaluation of instrument recognition. In Proceedings of the 4th International Conference on Big Data Technologies, Zibo, China, 24–26 September 2021; pp. 184–189. [Google Scholar] [CrossRef]
- Liang, X.; Li, Z.; Liu, J.; Li, W.; Zhu, J.; Han, B. Constructing a multimedia Chinese musical instrument database. In Proceedings of the 6th Conference on Sound and Music Technology (CSMT) Revised Selected Papers, Xiamen, China, 24–26 November 2018; Springer: Singapore, 2019; pp. 53–60. [Google Scholar] [CrossRef]
- Rajesh, S.; Nalini, N. Musical instrument emotion recognition using deep recurrent neural network. Procedia Comput. Sci. 2020, 167, 16–25. [Google Scholar] [CrossRef]
- Yang, P.T.; Kuang, S.M.; Wu, C.C.; Hsu, J.L. Predicting music emotion by using convolutional neural network. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 19–24 July 2020; Springer: Cham, Switzerland, 2020; pp. 266–275. [Google Scholar] [CrossRef]
- Shi, L.; Li, C.; Tian, L. Music genre classification based on chroma features and deep learning. In Proceedings of the 2019 Tenth International Conference on Intelligent Control and Information Processing (ICICIP), Marrakesh, Morocco, 14–19 December 2019; pp. 81–86. [Google Scholar] [CrossRef]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Sun, J.; Li, H.; Lei, L. Key detection through pitch class distribution model and ANN. In Proceedings of the 2009 16th International Conference on Digital Signal Processing, Santorini, Greece, 5–7 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
- Sharma, D.; Taran, S.; Pandey, A. A fusion way of feature extraction for automatic categorization of music genres. Multimed. Tools Appl. 2023, 82, 25015–25038. [Google Scholar] [CrossRef]
- Wu, M. Music Emotion Classification Model Based on Multi Feature Image Fusion. In Proceedings of the 2024 First International Conference on Software, Systems and Information Technology (SSITCON), Tumkur, India, 18–19 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Chang, P.C.; Chen, Y.S.; Lee, C.H. IIOF: Intra-and Inter-feature orthogonal fusion of local and global features for music emotion recognition. Pattern Recognit. 2024, 148, 110200. [Google Scholar] [CrossRef]
- Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional Neural Network (CNN) for Image Detection and Recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar] [CrossRef]
- Costa, Y.M.; Oliveira, L.S.; Silla, C.N., Jr. An evaluation of convolutional neural networks for music classification using spectrograms. Appl. Soft Comput. 2017, 52, 28–38. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- De Brebisson, A.; Vincent, P. An exploration of softmax alternatives belonging to the spherical loss family. arXiv 2015, arXiv:1511.05042. [Google Scholar] [CrossRef]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. SciPy 2015, 2015, 18–24. [Google Scholar]
- Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar] [CrossRef]
- Krstinić, D.; Braović, M.; Šerić, L.; Božić-Štulić, D. Multi-label classifier performance evaluation with confusion matrix. Comput. Sci. Inf. Technol. 2020, 1, 1–14. [Google Scholar] [CrossRef]
Input | Accuracy | Recall | Precision | F1 Score |
---|---|---|---|---|
MFCC | 0.9258 | 0.9256 | 0.9251 | 0.9286 |
CQT | 0.9198 | 0.9198 | 0.9273 | 0.9193 |
Chroma | 0.7996 | 0.7894 | 0.7974 | 0.7922 |
MFCC&CQT&Chroma | 0.9417 | 0.9397 | 0.9392 | 0.9366 |
Stacking | 0.9591 | 0.9562 | 0.9615 | 0.9590 |
Model | Accuracy | Recall | Precision | F1 Score | Time(s) |
---|---|---|---|---|---|
Ours | 0.9879 | 0.9858 | 0.9884 | 0.9859 | 2566.54 |
ResNet18 | 0.9278 | 0.9378 | 0.9266 | 0.9207 | 2396.88 |
ResNet50 | 0.9740 | 0.9749 | 0.9748 | 0.9733 | 3177.17 |
VGG16 | 0.9820 | 0.9815 | 0.9813 | 0.9813 | 4379.65 |
CNN | 0.9399 | 0.9352 | 0.9274 | 0.9306 | 2025.73 |
DenseNet | 0.9771 | 0.9757 | 0.9768 | 0.9782 | 3365.25 |
ViT | 0.9798 | 0.9779 | 0.9783 | 0.9775 | 4534.31 |
Case | Component | Evaluation | ||||
---|---|---|---|---|---|---|
S1 | S2 | Accuracy | Recall | Precision | F1 Score | |
1 | × | × | 0.9591 | 0.9562 | 0.9615 | 0.9590 |
2 | ✔ | × | 0.9698 | 0.9708 | 0.9674 | 0.9699 |
3 | × | ✔ | 0.9737 | 0.9719 | 0.9783 | 0.9759 |
4 | ✔ | ✔ | 0.9879 | 0.9858 | 0.9884 | 0.9859 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, J.; Gao, F.; Yun, T.; Zhu, T.; Zhu, H.; Zhou, R.; Wang, Y. A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics 2025, 14, 2805. https://doi.org/10.3390/electronics14142805
Yang J, Gao F, Yun T, Zhu T, Zhu H, Zhou R, Wang Y. A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics. 2025; 14(14):2805. https://doi.org/10.3390/electronics14142805
Chicago/Turabian StyleYang, Jinrong, Fang Gao, Teng Yun, Tong Zhu, Huaixi Zhu, Ran Zhou, and Yikun Wang. 2025. "A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments" Electronics 14, no. 14: 2805. https://doi.org/10.3390/electronics14142805
APA StyleYang, J., Gao, F., Yun, T., Zhu, T., Zhu, H., Zhou, R., & Wang, Y. (2025). A Deep-Learning Framework with Multi-Feature Fusion and Attention Mechanism for Classification of Chinese Traditional Instruments. Electronics, 14(14), 2805. https://doi.org/10.3390/electronics14142805