MDPI - Publisher of Open Access Journals

44 pages, 12058 KB

Open AccessFeature PaperEditor’s ChoiceArticle

Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models

by Amin Amiri, Alireza Ghaffarnia, Nafiseh Ghaffar Nia, Dalei Wu and Yu Liang

Mathematics 2025, 13(11), 1819; https://doi.org/10.3390/math13111819 - 29 May 2025

Cited by 3 | Viewed by 4495

This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its [...] Read more.

This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its FusionQuantizer architecture, built on FluxFormer, to efficiently capture essential signal features while minimizing complexity. We enhance features through STFT-based spectral decomposition, Hilbert transform analytic signal extraction, and SCLAHE spectrogram contrast optimization, and train using a composite loss function to produce reliable embeddings and construct a robust vector vocabulary. Experimental validation on music datasets such as E-GMD v1.0.0, Maestro v3.0.0, and GTZAN demonstrates high fidelity across 288 s of vocal signals (MSE = 0.0037, CC = 0.9282, Cosine Sim. = 0.9278, DTW = 12.12, MFCC Sim. = 0.9997, Spectral Conv. = 0.2485). Preliminary tests on text reconstruction and UCF-101 video clips further confirm Harmonizer’s applicability across discrete and spatiotemporal modalities. Rooted in the universality of wave phenomena and Fourier theory, Harmonizer offers a physics-inspired, modality-agnostic fusion mechanism via wave superposition and interference principles. In summary, Harmonizer integrates natural language processing and signal processing into a coherent tokenization paradigm for efficient, interpretable multimodal learning. Full article

(This article belongs to the Special Issue Applied Mathematics in Machine Learning and Cloud Computing: Foundations and Applications)

► Show Figures

Figure 1

16 pages, 1401 KB

Open AccessArticle

Efficient Music Genre Recognition Using ECAS-CNN: A Novel Channel-Aware Neural Network Architecture

by Yang Ding, Hongzheng Zhang, Wanmacairang Huang, Xiaoxiong Zhou and Zhihan Shi

Sensors 2024, 24(21), 7021; https://doi.org/10.3390/s24217021 - 31 Oct 2024

Cited by 7 | Viewed by 4274

Abstract

In the era of digital music proliferation, music genre classification has become a crucial task in music information retrieval. This paper proposes a novel channel-aware convolutional neural network (ECAS-CNN) designed to enhance the efficiency and accuracy of music genre recognition. By integrating an [...] Read more.

In the era of digital music proliferation, music genre classification has become a crucial task in music information retrieval. This paper proposes a novel channel-aware convolutional neural network (ECAS-CNN) designed to enhance the efficiency and accuracy of music genre recognition. By integrating an adaptive channel attention mechanism (ECA module) within the convolutional layers, the network significantly improves the extraction of key musical features. Extensive experiments were conducted on the GTZAN dataset, comparing the proposed ECAS-CNN with traditional convolutional neural networks. The results demonstrate that ECAS-CNN outperforms conventional methods across various performance metrics, including accuracy, precision, recall, and F1-score, particularly in handling complex musical features. This study validates the potential of ECAS-CNN in the domain of music genre classification and offers new insights for future research and applications. Full article

(This article belongs to the Special Issue Novel Sensors and Sensing Technology Used for Empowering High-End Equipment Structure)

► Show Figures

Figure 1

13 pages, 891 KB

Open AccessArticle

A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

by Jiyang Chen, Xiaohong Ma, Shikuan Li, Sile Ma, Zhizheng Zhang and Xiaojing Ma

Electronics 2024, 13(16), 3313; https://doi.org/10.3390/electronics13163313 - 21 Aug 2024

Cited by 14 | Viewed by 4523

Abstract

Music genre classification (MGC) is the basis for the efficient organization, retrieval, and recommendation of music resources, so it has important research value. Convolutional neural networks (CNNs) have been widely used in MGC and achieved excellent results. However, CNNs cannot model global features [...] Read more.

Music genre classification (MGC) is the basis for the efficient organization, retrieval, and recommendation of music resources, so it has important research value. Convolutional neural networks (CNNs) have been widely used in MGC and achieved excellent results. However, CNNs cannot model global features well due to the influence of the local receptive field; these global features are crucial for classifying music signals with temporal properties. Transformers can capture long-range dependencies within an image thanks to adopting the self-attention mechanism. Nevertheless, there are still performance and computational cost gaps between Transformers and existing CNNs. In this paper, we propose a hybrid architecture (CNN-TE) based on CNN and Transformer encoder for MGC. Specifically, we convert the audio signals into mel spectrograms and feed them into a hybrid model for training. Our model employs a CNN to initially capture low-level and localized features from the spectrogram. Subsequently, these features are processed by a Transformer encoder, which models them globally to extract high-level and abstract semantic information. This refined information is then classified using a multi-layer perceptron. Our experiments demonstrate that this approach surpasses many existing CNN architectures when tested on the GTZAN and FMA datasets. Notably, it achieves these results with fewer parameters and a faster inference speed. Full article

(This article belongs to the Special Issue Recent Advances of Cloud, Edge, and Parallel Computing)

► Show Figures

Graphical abstract

24 pages, 9531 KB

Open AccessArticle

Music Genre Classification Based on VMD-IWOA-XGBOOST

by Rumeijiang Gan, Tichen Huang, Jin Shao and Fuyu Wang

Mathematics 2024, 12(10), 1549; https://doi.org/10.3390/math12101549 - 15 May 2024

Cited by 4 | Viewed by 3381

Abstract

Music genre classification is significant to users and digital platforms. To enhance the classification accuracy, this study proposes a hybrid model based on VMD-IWOA-XGBOOST for music genre classification. First, the audio signals are transformed into numerical or symbolic data, and the crucial features [...] Read more.

Music genre classification is significant to users and digital platforms. To enhance the classification accuracy, this study proposes a hybrid model based on VMD-IWOA-XGBOOST for music genre classification. First, the audio signals are transformed into numerical or symbolic data, and the crucial features are selected using the maximal information coefficient (MIC) method. Second, an improved whale optimization algorithm (IWOA) is proposed for parameter optimization. Third, the inner patterns of these selected features are extracted by IWOA-optimized variational mode decomposition (VMD). Lastly, all features are put into the IWOA-optimized extreme gradient boosting (XGBOOST) classifier. To verify the effectiveness of the proposed model, two open music datasets are used, i.e., GTZAN and Bangla. The experimental results illustrate that the proposed hybrid model achieves better performance than the other models in terms of five evaluation criteria. Full article

► Show Figures

Figure 1

13 pages, 385 KB

Open AccessArticle

Attributes Relevance in Content-Based Music Recommendation System

by Daniel Kostrzewa, Jonatan Chrobak and Robert Brzeski

Appl. Sci. 2024, 14(2), 855; https://doi.org/10.3390/app14020855 - 19 Jan 2024

Cited by 21 | Viewed by 5772

Abstract

The possibility of recommendations of musical songs is becoming increasingly required because of the millions of users and songs included in online databases. Therefore, effective methods that automatically solve this issue need to be created. In this paper, the mentioned task is solved [...] Read more.

The possibility of recommendations of musical songs is becoming increasingly required because of the millions of users and songs included in online databases. Therefore, effective methods that automatically solve this issue need to be created. In this paper, the mentioned task is solved using three basic factors based on genre classification made by neural network, Mel-frequency cepstral coefficients (MFCCs), and the tempo of the song. The recommendation system is built using a probability function based on these three factors. The authors’ contribution to the development of an automatic content-based recommendation system are methods built with the use of the mentioned three factors. Using different combinations of them, four strategies were created. All four strategies were evaluated based on the feedback score of 37 users, who created a total of 300 surveys. The proposed recommendation methods show a definite improvement in comparison with a random method. The obtained results indicate that the MFCC parameters have the greatest impact on the quality of recommendations. Full article

(This article belongs to the Special Issue Machine Learning in Audio Signal Processing and Music Information Retrieval)

► Show Figures

Figure 1

15 pages, 1021 KB

Open AccessArticle

Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation

by Manav Garg, Pranshav Gajjar, Pooja Shah, Madhu Shukla, Biswaranjan Acharya, Vassilis C. Gerogiannis and Andreas Kanavos

Information 2023, 14(10), 527; https://doi.org/10.3390/info14100527 - 28 Sep 2023

Cited by 14 | Viewed by 4675

Abstract

The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and [...] Read more.

The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and automatic music transcription, making it relevant across academic and industrial domains. This paper presents a comprehensive comparison between standard deep learning architectures and emerging vision transformers, leveraging their success in various domains. We evaluate their performance on a specific subset of the GTZAN dataset, analyzing six different deep learning models. Our results demonstrate that DenseNet, a conventional deep learning architecture, achieves remarkable accuracy of 91.64%, outperforming vision transformers. However, we delve deeper into the analysis to shed light on the temporal characteristics of each deep learning model. Notably, the vision transformer and SWIN transformer exhibit a slight decrease in overall performance (1.82% and 2.29%, respectively), yet they demonstrate superior performance in temporal metrics compared to the DenseNet architecture. The significance of our findings lies in their contribution to the field of musical key estimation, where accurate and efficient algorithms play a pivotal role. By examining the strengths and weaknesses of deep learning architectures and vision transformers, we can gain valuable insights for practical implementations, particularly in music recommendation systems and automatic music transcription. Our research provides a foundation for future advancements and encourages further exploration in this area. Full article

(This article belongs to the Topic Advances in Artificial Neural Networks)

► Show Figures

Figure 1

11 pages, 710 KB

Open AccessArticle

Locally Activated Gated Neural Network for Automatic Music Genre Classification

by Zhiwei Liu, Ting Bian and Minglai Yang

Appl. Sci. 2023, 13(8), 5010; https://doi.org/10.3390/app13085010 - 17 Apr 2023

Cited by 14 | Viewed by 3313

Abstract

Automatic music genre classification is a prevailing pattern recognition task, and many algorithms have been proposed for accurate classification. Considering that the genre of music is a very broad concept, even music within the same genre can have significant differences. The current methods [...] Read more.

Automatic music genre classification is a prevailing pattern recognition task, and many algorithms have been proposed for accurate classification. Considering that the genre of music is a very broad concept, even music within the same genre can have significant differences. The current methods have not paid attention to the characteristics of large intra-class differences. This paper presents a novel approach to address this issue, using a locally activated gated neural network (LGNet). By incorporating multiple locally activated multi-layer perceptrons and a gated routing network, LGNet adaptively employs different network layers as multi-learners to learn from music signals with diverse characteristics. Our experimental results demonstrate that LGNet significantly outperforms the existing methods for music genre classification, achieving a superior performance on the filtered GTZAN dataset. Full article

(This article belongs to the Special Issue Artificial Intelligence in Audio and Music)

► Show Figures

Figure 1

14 pages, 483 KB

Open AccessArticle

An Efficient Hidden Markov Model with Periodic Recurrent Neural Network Observer for Music Beat Tracking

by Guangxiao Song and Zhijie Wang

Electronics 2022, 11(24), 4186; https://doi.org/10.3390/electronics11244186 - 14 Dec 2022

Cited by 10 | Viewed by 3713

Abstract

In music information retrieval (MIR), beat tracking is one of the most fundamental tasks. To obtain this critical component from rhythmic music signals, a previous beat tracking system of hidden Markov model (HMM) with a recurrent neural network (RNN) observer was developed. Although [...] Read more.

In music information retrieval (MIR), beat tracking is one of the most fundamental tasks. To obtain this critical component from rhythmic music signals, a previous beat tracking system of hidden Markov model (HMM) with a recurrent neural network (RNN) observer was developed. Although the frequency of music beat is quite stable, existing HMM based methods do not take this feature into account. Accordingly, most of hidden states in these HMM-based methods are redundant, which is a disadvantage for time efficiency. In this paper, we proposed an efficient HMM using hidden states by exploiting the frequency contents of the neural network’s observation with Fourier transform, which extremely reduces the computational complexity. Observers that previous works used, such as bi-directional recurrent neural network (Bi-RNN) and temporal convolutional network (TCN), cannot perceive the frequency of music beat. To obtain more reliable frequencies from music, a periodic recurrent neural network (PRNN) based on attention mechanism is proposed as well, which is used as the observer in HMM. Experimental results on open source music datasets, such as GTZAN, Hainsworth, SMC, and Ballroom, show that our efficient HMM with PRNN is competitive to the state-of-the-art methods and has lower computational cost. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

31 pages, 1653 KB

Open AccessArticle

Large-Scale Music Genre Analysis and Classification Using Machine Learning with Apache Spark

by Mousumi Chaudhury, Amin Karami and Mustansar Ali Ghazanfar

Electronics 2022, 11(16), 2567; https://doi.org/10.3390/electronics11162567 - 17 Aug 2022

Cited by 34 | Viewed by 14931

Abstract

The trend for listening to music online has greatly increased over the past decade due to the number of online musical tracks. The large music databases of music libraries that are provided by online music content distribution vendors make music streaming and downloading [...] Read more.

The trend for listening to music online has greatly increased over the past decade due to the number of online musical tracks. The large music databases of music libraries that are provided by online music content distribution vendors make music streaming and downloading services more accessible to the end-user. It is essential to classify similar types of songs with an appropriate tag or index (genre) to present similar songs in a convenient way to the end-user. As the trend of online music listening continues to increase, developing multiple machine learning models to classify music genres has become a main area of research. In this research paper, a popular music dataset GTZAN which contains ten music genres is analysed to study various types of music features and audio signals. Multiple scalable machine learning algorithms supported by Apache Spark, including naïve Bayes, decision tree, logistic regression, and random forest, are investigated for the classification of music genres. The performance of these classifiers is compared, and the random forest performs as the best classifier for the classification of music genres. Apache Spark is used in this paper to reduce the computation time for machine learning predictions with no computational cost, as it focuses on parallel computation. The present work also demonstrates that the perfect combination of Apache Spark and machine learning algorithms reduces the scalability problem of the computation of machine learning predictions. Moreover, different hyperparameters of the random forest classifier are optimized to increase the performance efficiency of the classifier in the domain of music genre classification. The experimental outcome shows that the developed random forest classifier can establish a high level of performance accuracy, especially for the mislabelled, distorted GTZAN dataset. This classifier has outperformed other machine learning classifiers supported by Apache Spark in the present work. The random forest classifier manages to achieve 90% accuracy for music genre classification compared to other work in the same domain. Full article

(This article belongs to the Special Issue Big Data Technologies: Explorations and Analytics)

► Show Figures

Figure 1

19 pages, 2467 KB

Open AccessFeature PaperArticle

A Middle-Level Learning Feature Interaction Method with Deep Learning for Multi-Feature Music Genre Classification

by Jinliang Liu, Changhui Wang and Lijuan Zha

Electronics 2021, 10(18), 2206; https://doi.org/10.3390/electronics10182206 - 9 Sep 2021

Cited by 15 | Viewed by 4797

Abstract

Nowadays, music genre classification is becoming an interesting area and attracting lots of research attention. Multi-feature model is acknowledged as a desirable technology to realize the classification. However, the major branches of multi-feature models used in most existed works are relatively independent and [...] Read more.

Nowadays, music genre classification is becoming an interesting area and attracting lots of research attention. Multi-feature model is acknowledged as a desirable technology to realize the classification. However, the major branches of multi-feature models used in most existed works are relatively independent and not interactive, which will result in insufficient learning features for music genre classification. In view of this, we exploit the impact of learning feature interaction among different branches and layers on the final classification results in a multi-feature model. Then, a middle-level learning feature interaction method based on deep learning is proposed correspondingly. Our experimental results show that the designed method can significantly improve the accuracy of music genre classification. The best classification accuracy on the GTZAN dataset can reach 93.65%, which is superior to most current methods. Full article

(This article belongs to the Special Issue 10th Anniversary of Electronics: New Advances in Systems and Control Engineering)

► Show Figures

Figure 1

17 pages, 4631 KB

Open AccessArticle

Revisiting Label Smoothing Regularization with Knowledge Distillation

by Jiyue Wang, Pei Zhang, Qianhua He, Yanxiong Li and Yongjian Hu

Appl. Sci. 2021, 11(10), 4699; https://doi.org/10.3390/app11104699 - 20 May 2021

Cited by 12 | Viewed by 7997

Abstract

Label Smoothing Regularization (LSR) is a widely used tool to generalize classification models by replacing the one-hot ground truth with smoothed labels. Recent research on LSR has increasingly focused on the correlation between the LSR and Knowledge Distillation (KD), which transfers the knowledge [...] Read more.

Label Smoothing Regularization (LSR) is a widely used tool to generalize classification models by replacing the one-hot ground truth with smoothed labels. Recent research on LSR has increasingly focused on the correlation between the LSR and Knowledge Distillation (KD), which transfers the knowledge from a teacher model to a lightweight student model by penalizing their output’s Kullback–Leibler-divergence. Based on this observation, a Teacher-free Knowledge Distillation (Tf-KD) method was proposed in previous work. Instead of a real teacher model, a handcrafted distribution similar to LSR was used to guide the student learning. Tf-KD is a promising substitute for LSR except for its hard-to-tune and model-dependent hyperparameters. This paper develops a new teacher-free framework LSR-OS-TC, which decomposes the Tf-KD method into two components: model Output Smoothing (OS) and Teacher Correction (TC). Firstly, the LSR-OS extends the LSR method to the KD regime and applies a softer temperature to the model output softmax layer. Output smoothing is critical for stabilizing the KD hyperparameters among different models. Secondly, in the TC part, a larger proportion is assigned to the uniform distribution teacher’s right class to provide a more informative teacher. The two-component method was evaluated exhaustively on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset GTZAN) classification tasks. The results showed that LSR-OS can improve LSR performance independently with no extra computational cost, especially on several deep neural networks where LSR is ineffective. The further training boost by the TC component showed the effectiveness of our two-component strategy. Overall, LSR-OS-TC is a practical substitution of LSR that can be tuned on one model and directly applied to other models compared to the original Tf-KD method. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

13 pages, 2477 KB

Open AccessArticle

Self-Supervised Transfer Learning from Natural Images for Sound Classification

by Sungho Shin, Jongwon Kim, Yeonguk Yu, Seongju Lee and Kyoobin Lee

Appl. Sci. 2021, 11(7), 3043; https://doi.org/10.3390/app11073043 - 29 Mar 2021

Cited by 18 | Viewed by 5176

Abstract

We propose the implementation of transfer learning from natural images to audio-based images using self-supervised learning schemes. Through self-supervised learning, convolutional neural networks (CNNs) can learn the general representation of natural images without labels. In this study, a convolutional neural network was pre-trained [...] Read more.

We propose the implementation of transfer learning from natural images to audio-based images using self-supervised learning schemes. Through self-supervised learning, convolutional neural networks (CNNs) can learn the general representation of natural images without labels. In this study, a convolutional neural network was pre-trained with natural images (ImageNet) via self-supervised learning; subsequently, it was fine-tuned on the target audio samples. Pre-training with the self-supervised learning scheme significantly improved the sound classification performance when validated on the following benchmarks: ESC-50, UrbanSound8k, and GTZAN. The network pre-trained via self-supervised learning achieved a similar level of accuracy as those pre-trained using a supervised method that require labels. Therefore, we demonstrated that transfer learning from natural images contributes to improvements in audio-related tasks, and self-supervised learning with natural images is adequate for pre-training scheme in terms of simplicity and effectiveness. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI