From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation

Wei, Lijun; Yu, Yuanyu; Qin, Yuping; Zhang, Shuang

doi:10.3390/info16080656

Open AccessReview

From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation

¹

School of Music, Neijiang Normal University, Neijiang 641100, China

²

School of Artificial Intelligence, Neijiang Normal University, Neijiang 641100, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(8), 656; https://doi.org/10.3390/info16080656

Submission received: 14 May 2025 / Revised: 27 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Text-to-Speech and AI Music)

Download

Browse Figures

Versions Notes

Abstract

Artificial intelligence (AI) has emerged as a significant driving force in the development of technology and industry. It is also integrated with music as music AI in music generation and analysis. It originated from early algorithmic composition techniques in the mid-20th century. Recent advancements in machine learning and neural networks have enabled innovative music generation and exploration. This article surveys the development history and technical route of music AI, analyzes the current status and limitations of music artificial intelligence across various areas, including music generation and composition, rehabilitation and treatment, as well as education and learning. It reveals that music AI has become a promising creator in the field of music generation. The influence of music AI on the music industry and the challenges it encounters are explored. Additionally, an emotional music generation system driven by multimodal signals is proposed. Although music artificial intelligence technology still needs to be further improved, with the continuous breakthroughs in technology, it will have a more profound impact on all areas of music.

Keywords:

music artificial intelligence; music generation; music rehabilitation and therapy; music education and learning

1. Introduction

Music is an artistic medium that brings people together and allows for the expression of emotions. It documents the evolution of society and cultural heritage, playing a unique and essential role in conveying feelings [1]. Traditionally, the creation and performance of music relies on the creativity, skills and experience of musicians. With the rapid development of AI, the approaches of music generation and expression are also expanded. Music AI refers to systems or tools that apply AI technology to the creation, performance, analysis, and understanding of music [2]. Music AI combines knowledge from multiple disciplines such as musicology, psychology, computer science, and information engineering. It has not only promoted the rapid development of automatic music generation and makes music performance more precise [3], but also played a positive role in music recommendation [4], appreciation [5] and therapy [6,7]. In China, the Central Conservatory of Music and the China Artificial Intelligence Society held the 1st and 2nd Summit on Music Intelligence (SOMI) in 2021 and 2023, respectively [8]. Music AI has significantly altered the music generation method from a mere computational tool to an innovative participant.

In the early stages, music AI was limited to algorithmic composition and sound synthesis, relying on rule-based systems to generate musical patterns. David Cope’s Experiments in Musical Intelligence (EMI) in the 1980s demonstrated that music AI could emulate the styles of classical composers with rule-based programming and pattern recognition [9]. The advent of machine learning and neural networks has enabled music AI to evolve beyond its role as a mere tool, becoming an active participant in the creative process. Platforms like Amper Music and AIVA (Artificial Intelligence Virtual Artist) allow users to generate music by inputting parameters such as genre, mood, and tempo, resulting in customized compositions [3]. Similarly, Google’s Magenta and OpenAI’s MuseNet demonstrated polyphonic composition across genres, blending classical and jazz elements with minimal human input [10,11]. These advancements have promoted music AI from a passive assistant to an innovative partner. In the evolution, a critical milestone is the shift toward human–AI collaboration, in which AI serves as an active role in the creative process [12]. Music AI has been widely used in industries like film scoring, where tight deadlines demand high-quality output, while it can provide efficient solutions. In addition, music AI has also entered the field of independent music creation. In 2016, Sony’s Paris Computer Science Laboratory introduced their original pop songs completely composed by AI in various styles, which indicates the growing independence of music AI as a creative tool [13].

To better understand the evolving collaborative role of music AI in music generation, this paper firstly survey the development and research progress in music generation associated with AI. The effects on the music industry, the challenges faced, and potential solutions across various applications are also discussed.

2. Music Generation

In the computer information process, music may be described as stream data that conveys different cognitive, emotional, social, and physiological functions [14]. Music generation is a representative and main application of music AI. It has the ability to learn and innovate autonomously, and can analyze a large amount of music data and patterns to generate new music representations. Music representations are used to encode, store, and manipulate musical information, such as pitch, rhythm, and harmony, in a way that computers can process. Generally, music data may be represented in several ways, such as audio representation, symbolic representation, piano roll representation, and sheet representation. Researchers have been trying to generate music using algorithms since the invention of the first computer, ENIAC, at the University of Pennsylvania in 1946. Initial efforts were based on the rule-based system and probabilistic approaches, and later, the advent of deep learning and evolutionary computation revolutionized this area. In recent years, the incorporation of advanced models like Generative Adversarial Networks (GANs), Transformers, and diffusion models enhance the qualities of generated music. Furthermore, some hybrid frameworks combining symbolic and audio generation methods strengthen the expression and structure of music compositions. The development roadmap and representative algorithm platform of music generation technology are shown in Figure 1.

2.1. Rule-Based Method

Rule-based music generation is the process of generating music based on a set of predefined music theory rules [53]. The rules are derived from music notation, composition principles, harmony, and other music construction methods [54]. The computer follows these rules to ensure that the generated music is structurally, harmonically, melodically, and rhythmically reasonable. In 1957, Max Mathews wrote the Music I software program using an IBM 704 computer (International Business Machines Corporation, New York, NY, USA) with a 12-bit vacuum tube digital-to-analog converter at Bell Labs and successfully played 17 s of music, becoming one of the pioneers of computer-generated music [3,55]. In the same year, Lejaren Hiller and Leonard Isaacson co-created the world’s first musical work composed entirely by a computer, the Illiac Suite, which consists of four movements. The first three music movements were generated by the rule-based algorithms, including generating fixed melody music through different lengths, generating four-part music using variation rules, and generating music according to different rhythm, dynamic and performance rules. The invention of the Illiac Suite is a milestone in the combination of music with computer science because it confirmed that the jumping changes of musical notes can be achieved through algorithms.

The rule-based method for music generation has been combined with generative grammar theory in its development. This theory, proposed by Noam Chomsky in 1957, suggests that a finite set of rules can generate an infinite number of linguistic structures [56]. By using simple grammatical rules, users can produce a variety of sentences and express complex ideas. Similarly, rule-based music generation follows a comparable principle. In 1968, based on this theory, Aristid Lindenmayer introduced a self-rewriting mathematical structures called L-systems, which are frequently used in computational biology and computer graphics [57]. L-systems have been used in music generation because they can control the details and overall structure of music by iteratively generating rhythm sequences. In 1986, Przemyslaw Prusinkiewicz interpreted a string of symbols generated by L-systems into musical notes, creating a method for L-system composition [58]. In 1994, Hanspeter Kyburz created the saxophone ensemble piece “Cell” using an L-system to generate the rules of derivation, which is one of the early achievements of L-system music composition [16]. In 2012, Tanaka Tsubasa et al. proposed a method for generating melodic grammar rules based on polyphonic music. A polyphony model was first defined and then the rules of that model were rewritten using L-systems and generative grammar to generate polyphonic music with the characteristic melodies [17].

A landmark of rule-based music generation platforms is Experiments in Musical Intelligence (EMI), invented by David Cope in 1989 [9]. By analyzing the musical structure, harmony and melody development of a large number of composers, EMI imitates these characteristics to generate corresponding rules to simulate the composer’s style and realize automatic creation. EMI has successfully generated music works in the style of composers such as Bach, Mozart, Beethoven, and Chopin with subtle differences. EMI indicates that AI composition gradually acquires the ability to generate music works under human instructions and acts as an auxiliary tool. It is an important breakthrough in the field of computer-generated music, challenging traditional notions of music creativity and sparking philosophical and ethical discussions about artistic creation and machine capabilities [59].

Rule-based music generation methods allow some specific harmonics to be defined with algorithms and are comprehensible for humans. Therefore, they are very useful for music works that need to strictly follow a specific music theory. In addition, they can quickly and efficiently generate a large amount of music content, thus saving time and labor costs. However, because different rules need to be established for different types of music, the level of music generation depends largely on the creativity of developers and how the inventors of musical concepts can express abstract notes with appropriate variables. Designing a set of rules that can produce satisfactory music works requires deep knowledge of music theory and technical implementation capabilities. In addition, in some cases, the rules should be constantly analyzed and adjusted to avoid generating monotonous or repetitive music.

2.2. Markov Chain

The Markov model is a mathematical model whose characteristic is that the next state is only related to the current state and has nothing to do with the historical state. In the field of music generation, the Markov chain algorithm is usually used to statistically analyze the changing patterns of notes in the composer’s music works and create a probability matrix to record these patterns. By using the probability matrix to generate the next musical element one by one, the generation of music can be achieved automatically.

As for the aforementioned Illiac Suite, the random notes of its fourth movement were generated through a Markov chain model, which used harmony and polyphony rules to select notes that follow these rules, thereby realizing a string quartet [60,61]. In 2002, François Pachet proposed a music generation system named Continuator based on hierarchical Markov models. It is able to generate music notes by learning from arbitrary styles in real time [18]. Late in 2011, François Pachet developed an interactive music generation system based on Markov chains, which can interactively generate blues-style music and Al Di Meola-style improvisations based on user input [19]. In 2016, Alexandre Papadopoulos et al. also proposed a music generation application named FlowComposer that enables a user to compose musical themes with only partial information, and the other parts are generated automatically in the style of a chosen composer [20]. Two Markov chains with regular constraints are combined in FlowComposer to generate the melody and related chord sequence. In 2016, Chih-Fang Huang et al. used Markov chains to analyze the melody of Chinese music, established a Markov switching table for the Chinese pentatonic scale, and generated 30 music samples combining various modes and rhythms [21].

As the Markov chain algorithm only considers the local note transition probability of the music, but does not take into account the overall structure of the music, the generated music often lacks long-term structure. To solve this problem, hidden Markov models (HMMs) were introduced [62,63]. HMMs can handle longer dependencies by introducing probabilistic relationships between hidden states (melody and rhythm of music) and observable states (instrument performances, sounds, and pitches), allowing the generation of more complex and expressive music [64]. In 2008, Microsoft developed an interactive automatic chord accompaniment system called MySong. It uses an HMM model trained on a music database, allowing even users with no music experience to quickly generate satisfactory song accompaniments by simply singing into a microphone in ten minutes [22]. In 2019, Duncan Williams et al. developed a music generation system using HMMs to generate a series of emotionally rich music clips based on the participant’s galvanic skin response [23]. The experiments indicated that emotionally relevant music can be composed by HMMs.

The advantage of Markov models is that they are easy to control and allow constraints to be added to the internal structure to be adapted to different music styles. However, Markov models can only capture the statistical information of music data based on the order of the model, and it is difficult to capture structures on a longer time scale. The music created by humans is meaningful at multiple structural levels such as melody, harmony, rhythm, and timbre, which may not be captured by low-order Markov models. Therefore, from a creative perspective, for longer music works, the Markov models may excessively reuse the melodies in the corpus, causing the music clips sometimes to appear monotonous and boring.

2.3. Evolutionary Computation

Evolutionary computation originated in the late 1950s and achieved significant development in the 1990s, promoted by advancements in powerful computing platforms [65]. Evolutionary computation is an algorithm that solves complex search and optimization by simulating biological evolution mechanisms such as selection, inheritance, and mutation in nature. It iteratively searches for candidate solutions (also called populations) and uses fitness functions to evaluate the quality of solutions. It then generates new candidate populations through operations such as selection, crossover (or recombination), mutation, and gradually approaches the optimal solution [66]. In music generation, the evolutionary algorithm requires a fitness function to assess the quality of each music clip. It begins by randomly generating music clips. Then, it evaluates these clips using the fitness function. The clip with the highest fitness is selected as an individual. Crossover and mutation operations are performed, resulting in a new music clip.

In 1991, Horner Andrew et al. first used evolutionary algorithms with the technique of thematic bridging in the field of music generation [24]. In 1993, John Al Biles used genetic algorithms to design an improvisational accompaniment system called “GenJam”. After learning 4 or 8 bars of the performer’s performance, it can automatically generate new melodies of similar style and also has the function of playing together with the performer [25]. In 2001, Palle Dahlstedt et al. used genetic algorithms to create an automatic piano playing system called MutaSynth that can generate classical music [26]. Later, he also developed the enhanced system named Ossia and composed a new piano piece every three minutes at the Gaudeamus Music Week 2002 in Amsterdam [67]. In 2016, Marco Scirea et al. developed an evolutionary algorithm-based improvisation music generation system called MetaCompose [27]. The system can not only automatically generate chords, melodies and accompaniment but also create music in different emotional states in real time. In 2020, Roberto De Prisco et al. used evolutionary algorithms to develop an automatic composition synthesizer called EvoComposer, which successfully solved the problem of inharmonious four-part harmonies (bass, tenor, alto and soprano) in previous automatic music generation [28]. EvoComposer achieved Bach-style four-part harmonies by conducting an in-depth analysis of Bach’s music and selecting a fitness function. Users only need to select one of the parts, and EvoComposer will automatically generate the other three parts, bringing a harmonious four-part harmony effect to music creation.

Evolutionary computation is an effective method that can help composers reduce their burden in the early stages of creation and quickly produce innovative and diverse melodies. By randomly mutating and combining music clips, it not only provides a fast and effective way to create music, but also interacts with users to adjust the style of music creation according to personal preferences. However, due to the subjectivity of musical aesthetics, it is very challenging to define a fitness function that reasonably evaluates music clips. In addition, it may take a lot of computing resources and time for the algorithm to evolve high-quality music works, which increases the computational cost of the evolutionary algorithm. In recent years, evolutionary computation has combined with deep learning for music generation [3].

2.4. Deep Learning

Deep learning is an algorithm based on neural networks that simulates the processing methods of the human brain. It collects a large amount of music data for standardized encoding, designs a neural network architecture tailored to the characteristics and requirements of the music data, and repeatedly trains the network until the output error is sufficiently small. The trained model is then used to generate music. Deep learning models used for music generation include recurrent neural networks, Generative Adversarial Networks, variational autoencoders, Transformer models and diffusion models.

2.4.1. Recurrent Neural Network

A recurrent neural network (RNN) is a sequential memory architecture that learns musical structure, chords, and melodies. It can memorize previous notes and predict the possibility of the next note based on this information. By repeating this process, it can generate coherent musical pieces [68].

In 1989, Peter Todd first used RNNs to generate simple melodies [29]. After training by selected melodies, the networks could generate new melodies similar to the training one. In 1994, Michael Mozer et al. proposed an RNN-based music generation model named CONNECT. After training with Bach’s music works, this model can not only generate monophonic melodies with chords but also compose note by note [30]. However, as the length of pieces and the number of higher-order structure increase, the performance of CONCERT is limited by its architecture and training procedure. The problem of vanishing gradient makes RNNs have difficulty processing long sequences of music data [69]. To solve this problem, Sepp Hochreiter et al. proposed a recurrent neural network (RNN-LSTM) structure combined with long short-term memory in 1997 [70]. By introducing a gating mechanism to solve the problem of gradient disappearance, the model can adjust its memory behavior to adapt to different types of sequence data and tasks, such as polyphonic music. In 2002, Douglas Eck proposed a music composition model based on RNN-LSTM which can generate 12-bar blues-style improvisational performance from a short recording [31]. In 2016, Google Brain introduced the Melody RNN model, which greatly improved the ability of RNNs to learn long-sequence music structures [32]. The Melody RNN can generate various types of music including classical and pop music. In 2016, Sony CSL developed the DeepBach platform using the RNN-LSTM model at the Paris Computer Science Laboratory. They used 352 Bach works for training and successfully created 2503 four-part works that demonstrated Bach’s style well. In order to test the performance of DeepBach, the team submitted these works to 1272 participants with professional knowledge of music, and more than 50% of them believed that these works were composed by Bach himself [33]. The experiment indicated that DeepBach has successfully passed the Turing test in simulating Bach’s composition style. In 2019, Zhu Jun et al. proposed a hierarchical recurrent neural network (HRNN) model to improve the structure of long-duration music [34]. This model uses a hierarchical generation method to gradually form measures, beats, and notes. Experiments have shown that compared with other RNN models, the music melody generated by the HRNN model is better.

RNNs have a unique ability to process and learn time series data, can capture the temporal characteristics of music, and can create rhythmic and coherent music. However, due to the over-fitting phenomenon that may exist in RNN music generation, the generated music might lack diversity and innovation [71]. In addition, the structure of RNNs will limit the controllability of music creation, making it difficult to accurately control the characteristics and style of music. Finally, RNNs’ ability to interpret music is relatively insufficient, and it is currently difficult to provide a detailed explanation and analysis of the generated music.

2.4.2. Generative Adversarial Network

The principle used by Generative Adversarial Networks (GANs) to generate music is to achieve the goal of automatic music generation by letting two neural networks (a generator and discriminator) compete with each other. The generator is responsible for generating music, while the discriminator is responsible for distinguishing the difference between generated music and real music. Through continuous adversarial learning and adjustment, the generator gradually improves the quality of generated music and makes it close to the level of real music. The GAN was proposed by Ian Goodfellow in 2014 and has achieved many breakthrough results in computer vision, natural language processing and other fields [72].

In 2017, Li-Chia Yang et al. developed a GAN-based music generator named MidiNet. Different to directly processing audio waveforms, MidiNet generates music from the symbol domain, making the generated music clearer in structure and significantly improving the quality of music. Experimental results showed that compared with Google’s Melody RNN, the music melodies generated by MidiNet are more interesting [35]. In 2018, Dong Haowen et al. developed the music generation system named MuseGAN, which can not only create complex musical works with multiple tracks of different instruments such as piano, drums, and guitar, but also generate accompaniment that conforms to the rules of harmony based on a given melody line [36]. In 2018, Nicholas Trieu et al. developed a jazz generative model called JazzGAN. This model has a deep understanding of jazz, effectively showcasing harmony, melody, rhythm, and structure. It also has the capability to improvise [37]. In 2018, Chris Donahue et al. developed the WaveGAN, which was the first to apply GAN to the unsupervised audio synthesis of raw waveform audio signals. Combined with parallel computing technology, it can generate hours of audio in less than 2 s [38]. In 2019, Google’s Magenta team developed the audio synthesis model GANSynth, which can generate complete audio clips at one time. Compared with the previous WaveGAN, the music clips generated by GANSynth are better. It is good at generating realistic instrument sounds, especially those musical details that are difficult to model by physical methods or capture by sampling technology [39].

GAN can generate high-quality music works at a high speed, and can also generate unique style music works by learning a large amount of music datasets. However, GAN also has some limitations in music generation. GAN requires a lot of computing resources and time to train. Furthermore, there are currently no sufficiently reliable research results to explain how to effectively use GAN to encode text data or music scores. Although GAN has achieved great success in the field of image generation, its application in the field of music still faces some challenges.

2.4.3. Variational Autoencoder

Variational autoencoder (VAE) is a neural network structure proposed by Diederik P. Kingma and Max Welling in 2013 [73]. It is a generative model comprised of encoders and decoders. VAE generates new music works by learning the distribution characteristics of input data. The encoder transforms the melody, rhythm, harmony, timbre, style, and other characteristics of the input music into a latent vector, which is a highly abstract mathematical representation that can capture the underlying attributes of the input data. Meanwhile, the decoder reconstructs this vector back into the original music. VAE uses the variability of the latent vector to generate a variety of music works.

In 2018, Gino Brunner et al. first introduced a VAE dynamic polyphonic music generation system called MIDI-VAE [40]. By altering pitches, dynamics, and instruments, MIDI-VAE enables seamless transitions between genres, such as classical to jazz. In 2018, Google Brain designed the MusicVAE model for music generation. The model combines a hierarchical decoder and a long-term structure music sequence model to accurately capture the structure of long-duration polyphonic music. Quantitative and qualitative experimental measurements show that the performance of the MusicVAE model significantly exceeds that of the traditional RNN model [41]. The model was later enhanced by Ian Simon et al., who introduced chord conditions, making it the first model in the world capable of generating multi-track sequential music in latent space [74]. In 2019, Jing Luo et al. developed a music generation model called MG-VAE, applying VAE technology to Chinese music creation for the first time. By analyzing more than 2000 folk songs in the Chinese folk song database, the model can create folk songs with characteristics of different regions of China [42]. In 2020, Ziyu Wang et al. proposed a piano polyphonic music generation model called PIANOTREE VAE. The model integrates the structure of music data with its architecture, while also incorporating the concepts of sparsity and hierarchical priors [43]. By training on 5000 classical and popular piano pieces, it successfully generated melodious piano pieces. Compared with traditional VAE models, PIANOTREE VAE has achieved better results in data reconstruction, interpolation, data generation, and model interpretability.

VAEs can generate diverse and complex musical pieces by learning and modeling the latent space. VAEs allow users to modify the style and emotional quality of music by adjusting latent variables, generating compositions that are both expressive and tailored to individual preferences. Additionally, VAEs can quickly generate a high volume of quality music samples due to their efficient training process. However, because VAE operates at a higher, more abstract feature level, the generated music may not sound as good as music generated directly from the audio waveform. Secondly, when generating new data, VAE is prone to over-smoothing, which may result in insufficient dynamic effects in the music. In addition, in the training process of VAE, designers need to carefully adjust parameters and training strategies.

2.4.4. Transformer Architecture

Transformer is a deep learning model launched by Google in 2017 [75]. Unlike traditional recurrent architectures that process data sequentially, Transformers utilize a self-attention mechanism to compute attention weights for each input sequence. This allows them to capture the global dependencies across the entire sequence data. Transformers consist of stacked encoder layers and decoder layers containing multiple sub-models. Each layer of the encoder contains feed-forward neural networks and multi-head self-attention mechanisms, enabling it to extract context-aware representations through self-attention. Each layer of the decoder contains feed-forward neural networks, multi-head self-attention and masked multi-head self-attention mechanisms, enabling it to pick up the relevant information encoded by the encoder. This architecture make Transformers able to train at high speed, enabling it to measure the relative importance of various input data. They are specifically used to process natural language tasks and have been widely used in machine translation, text generation, question-answering systems and other fields. In 2020 and 2023, OpenAI released the chat-generating pre-trained Transformers ChatGPT-3 and ChatGPT-4 based on the Transformer model, which generate text content based on users’ input requests.

In music generation, Transformer models convert the input music data into a series of information including pitch, duration, music style and other features. Through the cooperation of the encoder and decoder, Transformer can predict the sequence of the next note and generate music works with rich layers and rhythm. Compared with other music generation models, Transformer can not only generate new notes using contextual information and global features, but also continuously optimize the model through self-training and continuously improve the quality of music generation. These features make Transformer models play an important role in the field of music generation.

In 2018, Google developed a music generation model called Music Transformer, which uses Bach’s four-part choral score data and piano performance music data to generate a 1 min music piece with 10-millisecond resolution. Users can also generate personalized music accompaniment through text input [44]. Music Transformer employs relative position representations to enable attention to be informed by the distance between positions in a sequence. The memory requirement of Music Transform is reduced by using the “skewing” procedure so that longer-sequence music can be generated [44]. In 2019, OpenAI developed MuseNet based on the GPT-2 model for music generation. MuseNet can generate 4 min of music with up to 10 different instruments and styles. It also can generate different styles of harmony and accompaniment based on the theme input by the user. Two modes of “simple” and “advanced” are offered in MuseNet, enabling users have the opportunities for exploration and interaction. The simple mode gives users a ready-made set of samples to try out different composers, styles. It helps users explore the variety of music styles generated by MuseNet. In contrast, the advanced mode lets users interact more directly with MuseNet to make completely new pieces.

The traditional Transformers are mainly used for text generation. Due to the obvious difference between music patterns and text, these early studies used the same strategies as those for text generation tasks to train models. Therefore, they are unable to consistently produce long-time music works with high quality. In 2020, Zhang Ning et al. proposed a Transformer model combined with a GAN model. This model combines generative adversarial learning with a self-attention architecture to generate long sequences of notes guided by adversarial goals, while the Transformer model is used to learn global and local musical structures. Compared with original music Transformers, this model can generate higher-quality long-duration musical works [76]. In 2020, Xia Guangyu et al. proposed Transformer VAE, which integrates the Transformer architecture with VAE to enhance structural awareness and interpretability in music generation. The structure of Transformer learns the long-term dependencies with self-attention mechanisms and VAE learns interpretable latent representations. Therefore, Transformer VAE features the ability to learn long-term signals with a global memory structure and local contextual content [77]. In 2024, Wang Weining et al. proposed a style-adjustable music generation model Transformer–GAN, which introduced a style constraint generator and a style discriminator to constrain style information and adversarial learning mechanisms, respectively. This model can generate a complete musical work according to the target style specified by the user [78].

In recent years, commercial companies have developed music generation models based on Transformer. In 2020, Open AI released the automatic music generation model Jukebox, which uses VQ-VAE (vector quantized variational autoencoder) technology to encode music and generates music after training the model through a Transformer-based model [45]. Jukebox can generate music pieces that last several minutes, with singing that sounds natural and is easy to recognize. In 2023, Google proposed a music generation model called MusicLM, which can generate high-fidelity music from text [46]. This model has shown its ability to generate music pieces that range from 30 s to 5 min at 24 kHz. It can also produce music based on a series of text prompts that express shifts in mood or storyline. This capability highlights the model’s ability to smoothly transition between various emotional levels in the music it generates. In the same year, Meta also released a music generation model called MusicGen with Transformer architecture [47]. MusicGen differs from MusicLM because it does not need a self-supervised semantic representation and generates all four codebooks at the same time. By adding a slight delay between the codebooks, it becomes possible to predict them in parallel. MusicGen is an advanced single-stage model for music generation that can be controlled by both text and melody inputs.

Transformers excel at capturing long-range dependencies and intricate musical structures. These abilities make them particularly effective for generating music that features complex arrangements and extended sequences. Their architecture allows for the creation of coherent compositions, maintaining thematic consistency throughout. Transformers utilize self-attention mechanisms, enabling them to analyze and synthesize various musical elements, resulting in rich and diverse outputs. However, Transformers require significant computational resources, which can make them less accessible for smaller projects or organizations. Additionally, training these models necessitates large amounts of high-quality data to achieve optimal performance.

2.4.5. Diffusion Model

A diffusion model is a deep learning model based on a probability generation framework. Its core idea is derived from the random process of non-equilibrium thermodynamics [79]. The model generates data by defining the forward diffusion process (Forward Process) and the reverse denoising process (Reverse Process). In the Forward Process, Gaussian noises are gradually added to the input data to destroy the structure of the data distribution. In the Reverse Process, the deep neural networks are trained to optimize the upper and lower bounds on the entropy of each step in the reverse trajectory, learning to gradually reconstruct the original data distribution from the noise. Compared with GAN and VAE, the diffusion model shows higher fidelity and training stability in image and audio synthesis tasks. Its iterative generation mechanism can accurately model data details.

In 2022, the Riffusion model was proposed to generate real-time music audio clips from text-based prompts and images of spectrograms based on a fine-tuned diffusion model, the Stable Diffusion model [48]. Riffusion provides a web-based user interface that allows users to select music style or input text keywords without software installation. In 2023, Qingqing Huang et al. introduced Noise2Music, using two types of diffusion models to generate 30-s high-quality music clips from text prompts [49]. The Noise2Music model exhibits strong generative capabilities, allowing users to create musical compositions that exceed simple label conditioning. By utilizing the semantic content of the provided captions, these models generate complex and detailed music. Noise2Music is a music generation model that creates music using detailed text prompts. It can generate music based on attributes like genre, instrument, tempo, mood, vocals, and era. It can also create music from creative prompts, making it highly versatile. In 2023, Flavio Schneider et al. developed a cascading two-stage latent diffusion model called Moûsai to generate several minutes of high-quality stereo music at 48 kHz from input texts [50]. Moûsai was trained on a diverse dataset comprising 2500 h of stereo music sampled at 48 kHz, utilizing metadata such as title, author, album, genre, and release year as textual descriptions. In 2025, Ziqian Ning et al. proposed the DiffRhythm, the first latent diffusion-based song generation model that can synthesize complete songs, featuring both vocals and accompaniment, with durations of up to 4 min and 45 s in just 10 s, while ensuring high musical quality and clarity [51]. DiffRhythm consists of two sequentially trained models: The first is a VAE that learns a compact latent representation of audio waveforms, enabling the modeling of long audio pieces lasting several minutes. The second model, Diffusion Transformer, operates within the latent space of the VAE, using an iterative denoising process to generate complete songs. This combination allows for effective song generation while maintaining high audio quality and structure, paving the way for innovative approaches in audio synthesis and music production.

Due to the excellent performance of diffusion models, commercial companies have also launched related music generation platforms. In 2024, a music generation model based on Transformer and diffusion models called Suno V3 was released by Suno. Users simply input lyrics and music style, and the model can quickly generate a complete 2-min song, including vocals and instrumental accompaniment [52]. Suno V3 also supports lyrics in multiple languages, such as English, Chinese, Japanese, and Russian, and can even create songs in Cantonese. In 2024, Kunlun Tech launched SkyMusic, China’s first AI music generation model, achieving state-of-the-art performance for large music models. SkyMusic utilizes a Sora-like architecture in the music audio field. Both Suno 3 and SkyMusic adopt two-stage, large-scale and Diffusion Transformer generation methods [15]. In the same year, Stability AI released the music generation tool Stable Audio 2.0, which uses an improved Transformer and can process longer audio data. Users can generate 3-min stereo music of different styles and complete structure by describing the music features in text, with a sound quality of up to 44.1 kHz [15]. Compared with Suno V3, Stable Audio 2.0 can generate longer music, but it cannot automatically generate lyrics.

Diffusion models are good at generating high-quality audio, producing high-fidelity music effectively. They are particularly well-suited for creating detailed sound effects and music in media production. Their ability to capture complex audio characteristics makes them ideal for professional applications where sound quality is paramount. However, their training and generation times can be lengthy, which poses difficulties in real-time scenarios. This limitation can hinder their use in applications requiring immediate audio feedback, such as live performances or interactive media. Additionally, the computational resources required for training these models can be substantial.

2.5. Summary of Music Generation Models

In the past, music generation primarily relied on manually established rules and patterns, resulting in melodies that were often mechanical and monotonous. However, with the advancement of deep learning technology, neural network-based music generation models have gradually emerged, making music melodies richer and more vibrant. A comparison of these music generation models is shown in Table 1. The advantage of rule-based music generation models is that they are easy to control and understand, but their drawback is their lack of creativity and flexibility. Markov chain models are characterized by simplicity and effectiveness, but they are not suitable for long-duration music. Evolutionary computation models have good adaptability to feedback, but their generation time is long and quality relies on the fitness function. RNN models can capture the temporal characteristics in music generation, but their quality is low in long-duration music. The GAN model can produce high-quality music samples, but it shows instability during training and requires a large dataset. The VAE model has the potential to generate diverse music but the generated music lacks quality and diversity. The Transformer model can generate long-duration music with complex arrangements, but it has lower efficiency when processing long sequences. Diffusion models can generate high-quality music samples, but they have higher model complexity and require significant resource consumption.

In recent years, Transformer and diffusion models have shown obvious advantages in music generation and have been applied in multiple commercial platforms. Transformer models excel in sequence modeling and parallel computing, and are suitable for generating structured and coherent musical works. Diffusion models, on the other hand, excel in generating high-quality, diverse, realistic and creative music. The choice of which model to use depends mainly on the specific application scenario and requirements. In the future, Transformer models and diffusion models will be further integrated, combining each other’s advantages and overcoming their respective shortcomings.

3. Applications of Music AI

With the continuous deepening development of artificial intelligence technology, the application fields of music AI show a diversified trend. This section examines the current state and limitations of commercial applications of artificial intelligence in music creation and composition, music rehabilitation and therapy, as well as music education and learning.

3.1. Music Generation and Composition

At present, many AI composition platforms have realized the function of music composition. In 2015, Jukedeck launched Jukedeck MAKE, an online music creation platform based on deep learning technology. It can generate a few minutes of music based on the music type, mood, instrument, rhythm and other parameters input by the user. In 2016, Sony’s Paris Computer Science Laboratory created songs of “Daddy’s Car” in the Beatles’ style, and “The Ballad of Mr. Shadow” in the style of American songwriters with the FlowMachines generation model [80]. In the same year, Google launched the Magenta platform based on deep learning. This platform can convert and arrange the style of music and generate music accompaniments for multiple instruments, such as guitar, violin, piano, etc. At the commemoration of Bach’s birthday in March 2019, Google used the Magenta music artificial intelligence platform to develop the “Bach Doodle” music creation tool. Users can draw shapes and patterns with a mouse or stylus to generate Bach-style music that matches the style of the drawing [81].

AIVA Technology is focused on developing a platform for creating classical and symphonic music, and has become the world’s first virtual composer certified by the French Society of Authors, Composers and Publishers of Music (SACEM) [82]. AIVA has released “Genesis”, “Among the Stars”, and the Chinese-themed album “Ai Wa”. In October 2020, Korean music newcomer Xia Yan debuted with the song “Eyes on You,” which was created using the music artificial intelligence model EvoM, making her the first human singer in the world to launch a song composed by AI. In April 2019, OpenAI released the music artificial intelligence system MuseNet, which can generate a 4-min musical work based on the style, instrument, harmony and other elements input by the user. MuseNet can understand and generate music for 10 different instruments, including 15 different styles such as country, Mozart, and the Beatles. It can also generate mixed-style music and accompaniment works, and even play Lady Gaga’s songs in the style of Mozart. In 2023, the music generation platform MusicFX launched by Google can synthesize music based on the music style, music elements, instruments and scenes input by the user. In November 2023, the Beatles released their final single, “Now and Then.” This song utilized music artificial intelligence technology to extract John Lennon’s vocal track from a tape recorded before his death in 1978. It successfully combined these vocals with guitar accompaniment created by George Harrison in 1995, effectively reviving the song from 45 years ago and showcasing the immense potential of artificial intelligence in music synthesis [83].

In June 2023, Meta released the music generation model MusicGen, which was trained on 20,000 h of authored music data, covering a wide range of styles and genres. MusicGen can generate 30-s stereo music at 32 kHz according to texts and also allows for conditional music generation guided by melodies uploaded by users. MusicGen is capable of producing music for films, video games, and other multimedia applications. In March 2024, Suno launched Suno V3, a song generator described as the “ChatGPT of the music world”. Unlike earlier music generators, this tool allows users to create songs in multiple languages by simply entering basic prompts, generating not only lyrics but also vocals and soundtracks. In November 2024, Suno V4 was released, allowing music generation of up to 4 min in length. Compared to earlier versions, it features a significant improvement in audio quality, supports more complex and diverse song structures, and enhances the emotional expressiveness of the music. A lyric assistant was introduced to improve the creativity. In May 2025, Suno V5 was launched, further extending the music generation duration to 8 min. In April 2024, Stability AI introduced Stable Audio 2.0, which enhanced both the length and quality of generated music, enabling the production of 3 min of stereo-quality music at 44.1 kHz. The music generated can be customized by users, allowing for variations and sound effect creation. Its earlier version, Stable Audio 1.0, was recognized by TIME magazine as one of the best inventions of 2023. In April 2024, Udio was launched, enabling users to create music and generate lyrics based on simple text prompts by specifying themes, genres, and other descriptors. Initially, the generated songs were 30 s long, but this was later updated to allow for a duration of up to 15 min. Udio supports multiple languages, allowing users to produce a diverse range of musical works, including Chinese pop, Japanese pop, Russian pop, and Latin rhythms.

Although music AI has made significant strides in music generation and composition, it still faces limitations. First, music AI’s ability to express sensibility and emotion needs to be improved. Machines generate new works by learning from and imitating existing music, making it challenging to produce music that surpasses human imagination. Second, music AI lacks true creativity and inspiration. It cannot infuse rich emotions and personal experiences into the creative process like human musicians can.

3.2. Music Rehabilitation and Therapy

Music can relieve patients’ psychological problems such as anxiety, depression and stress, and improve their mood and quality of life. In the field of motor rehabilitation, research has shown that musical rhythms can influence movement in patients with neurological disorders. This discovery paves the way for using rhythm and music as a consistent time reference to stimulate the motor system. It can help reprogram and improve the execution of movement patterns in these patients [84]. It is possible to match body movements to external sounds with rhythm, like music or a metronome. This happens because the regular and predictable rhythm of the music is easily detected by human ears. This detection helps align brain activity in the areas responsible for listening to rhythms and making movements [85]. Music therapy can help patients with limited motor skills to restore language and motor functions, and can also be used to treat chronic diseases such as autism [86], Alzheimer’s disease [87] and cancer [88].

Hu Bin et al. developed Ambulosono, a gait rehabilitation training system for Parkinson’s patients. Through the music AI training model, the system establishes a positive feedback treatment method between walking and music, helping patients regain the ability to control their gait and stride [89]. Gait disturbances, postural instability, and an increased risk of falls are common symptoms in patients with mild to moderate dementia and Alzheimer’s disease. Wittwer Joanne et al. studied the feasibility of home Rhythmic Auditory Stimulation (RAS) gait training to improve movement-related deficits in early Alzheimer’s disease. After the intervention, evaluations showed a significant increase in gait speed and an improvement in stride length [90]. Gonzalez-Hoelling et al. investigated the effect of RAS on gait training in patients following stroke. It indicated that patients receiving combined RAS training made greater improvements. They showed better functional ambulation, enhanced walking ability, and increased independent walking compared to participants in the conventional rehabilitation program [91]. Music-supported therapy is among the most extensively researched music-based interventions for addressing upper limb hemiparesis following a stroke [92]. The treatment technique is founded on the idea that playing a musical instrument is a pleasurable activity. It involves complex, coordinated movements that necessitate auditory–motor coupling and the integration of real-time multisensory information. Combining active music rehabilitation therapy with conventional therapy can effectively enhance upper limb movement parameters, including speed, accuracy, and smoothness. This integrated rehabilitation program can significantly improve the functional movement of the paralyzed upper limb following subacute and chronic stroke [93]. Dogruoz Karatekin et al. reported significant improvements in functional abilities, grip strength, finger strength, and gross and fine motor skills after three months of therapeutic instrumental music performance piano intervention in nine adolescents with cerebral palsy. These results show the potential of this music-based therapy [94]. In addition, music therapy significantly contributes to enhancing reading skills and addressing phonological awareness issues in children with dyslexia [95].

As an auxiliary method of treatment, music therapy with AI also has some problems to be solved in its development. First, music AI lacks emotional communication and human care with people in treatment, which is crucial for patients with psychological and emotional problems. Secondly, everyone’s psychological and emotional problems are unique and complex, and the algorithms and models of music AI may not meet the needs and changes of different individuals. Therefore, in the process of music therapy, human music therapists and music AI should collaborate with each other to achieve better treatment results.

3.3. Music Education and Learning

Musical AI has also had a significant impact in the field of music education and learning. Studying music not only enriches learners’ understanding and appreciation of musical art but also positively contributes to brain development and enhances cognitive abilities. Music learning has been shown to help children develop motivation and concentration skills. These skills enable them to maintain focus for longer periods, which may ultimately support their reading development [96]. A long-term study indicates that individuals who undergo musical training maintain higher levels of cognitive ability even decades later. People with more experience with musical instruments may experience greater improvements in general cognitive abilities [97]. However, due to various restrictions, not everyone has the opportunity to receive music education. According to a report released by the 2022 National Arts Education Data Project (AEDP), although U.S. public schools have made some progress in offering music courses in recent years, more than 3.6 million students still do not receive music education [98]. In addition to financial factors, students may hesitate to learn music for various other reasons, including time constraints and the challenge of finding music that truly engages their interests. Furthermore, a study in 2021 revealed that approximately 50% of teenagers in the UK and Germany discontinued music lessons and other musical activities by the age of 17 [99]. This trend is primarily attributed to the conflicts between music education and other leisure or study commitments, as well as the scarcity of performance opportunities, which can result in a diminished sense of achievement in their learning and ultimately lead to a waning interest in pursuing music. These findings highlight the shortcomings of traditional music education and underscore the importance of offering personalized learning environments. It is essential to provide students with the opportunity to showcase their individual talents and abilities. When students find joy in playing musical instruments and receive encouragement, they are more likely to persist in their music studies and cultivate their own unique musical identities.

The music AI education learning platform can provide students with a more flexible and convenient way to learn music. Students can select learning methods and instruments based on their interests and practice at their own pace, moving beyond the limitations of traditional music education. Personalized learning plans allow students to focus on their studies without feeling pressured by others’ progress. Furthermore, students have the flexibility to choose when and where to study. There are several innovative open music education and learning platforms that cater to users’ needs for personalized and engaging musical experiences. One prominent example is Yousician, an AI learning platform developed by a Finnish company. Yousician offers access to tens of thousands of songs across various genres and provides a comprehensive learning experience by displaying music scores, fingerings, and offering built-in metronome and tuner functions. It supports learning for 17 different instruments, including guitar and piano. One of Yousician’s standout features is its ability to analyze students’ performance accuracy in real time, providing immediate feedback and evaluations. This capability allows the platform to create personalized learning plans tailored to each student’s progress and goals. According to statistics, over 20 million users utilize Yousician for music learning each month, indicating its popularity and effectiveness. In addition to Yousician, there are other similar music education platforms, such as SmartMusic and Tonestro. SmartMusic focuses on providing interactive practice and assessment tools for students, allowing them to play along with accompaniments and receive feedback on their performance. Tonestro offers a gamified approach to learning, encouraging users to improve their musical skills through engaging exercises and immediate feedback.

In the research area, Jin Wei et al. introduced a music education and teaching method with AI to enhance the functions of music education management and developed relevant performance models that effectively evaluate the implementation of music education. Compared with existing models, the proposed model improves students’ learning achievement rate, efficiency ratio, teaching performance analysis rate, etc. [100]. Qin Wu et al. developed an AI decision support system to enhance the efficiency and accuracy of music education management. Additionally, they also employed time series prediction models based on convolutional neural networks (CNNs) to forecast technological trends in music education, providing data support for educational policies and resource allocation [101].

In conclusion, the advantages of utilizing music AI in music education are promising. It can not only facilitate learners to study independently and effectively but also analyze the students’ performance skills in real time with technical suggestions. Although music AI plays an important role in music education, it cannot replace human teachers now. This is because music, as an art of expressing emotions and conveying culture, is far more complex than AI technology can understand. Teachers can help learners understand the emotions and stories behind music, and teach them how to express their inner emotions into notes and melodies. Moreover, music is an interactive art form. AI cannot understand the communication between human thoughts and behaviors. Teachers teach musical skills and creative abilities. They also help students learn how to engage in group activities like choruses, ensembles, and accompaniment.

3.4. Summary

Music AI has achieved significant progress in various fields of music. In music generation and composition, music AI allows more users to freely compose their favorite music. This not only lowers the barriers to music creation but also significantly enriches the diversity and quantity of music works. In music rehabilitation and therapy, music AI can effectively help patients alleviate mental stress and anxiety, as well as aid in recovery from illnesses. In music education and learning, teachers can tailor their instruction to individual needs, and students can learn various instruments more easily, which promotes a higher level of musical literacy among the public. Although there are still challenges of music AI in these fields, continuous technological advancements will lead to its broader applications.

4. Discussion

AI-generated music has become the most successful and influential product in the music field today, and its impact on the music industry is profound and significant. It has not only changed the process of music creation, but has also gradually altered the structure and economics of the music industry. The accompanying problems and challenges are becoming increasingly apparent. This section explores the ethical issues of AI-generated music, its impact on the structure and economy of the music industry, its impact on the music creative field, and the limitations of music AI.

4.1. Impact on the Music Industry

The traditional process of music creation demands a significant amount of technical skills and creativity, often involving collaboration among musicians, producers, and composers. Nowadays, music AI is transforming this process by implementing advanced techniques that simplify and expedite music creation. A study shows that if record companies properly use artificial intelligence technology in music creation, their profitability will be significantly improved [102]. This highlights the potential and benefits of incorporating generative AI into the music industry. However, the emergence of AI-generated music also introduces a range of ethical and economic challenges that the industry must address.

4.1.1. Impact on Intellectual Property

One of the most pressing ethical issues concerns the authorship and ownership of music created by AI. Besides the concerns surrounding authorship, AI’s capability to imitate a musician’s style has also sparked worries about the potential for plagiarism. Music AI platforms such as Suno and Udio have the capability to analyze vast amounts of music data and generate compositions that resemble the styles of various artists. This approach may lead to the risk of unintentional imitation, meaning that AI-generated music may create works that are highly similar in style to the artists without their permission or financial compensation. The blurring of the line between original and derivative works raises ethical questions about intellectual property protection and how to constrain AI from imitating the existing musical works. In June 2024, record companies including Universal, Sony and Warner accused Udio and Suno of utilizing a significant amount of copyrighted music in the training of their AI models. The relevant lawsuit is still ongoing.

The process of AI-generated music may be described as inputs and outputs. The input encompasses resources like music databases and theoretical knowledge, which the music AI utilizes to produce its output. Copyright disputes concerning the input phase primarily revolve around the authors or owners of the original materials and the companies that employ AI to generate content. In April 2023, the AI-generated song “Heart on My Sleeve” was taken down because Universal Music Group claimed that the song infringed on the copyrights of artists Drake and The Weeknd [103]. AIVA has chosen to focus on classical music to circumvent copyright issues. Many classical music works are in the public domain, which allows for their use with lower legal risks associated with copyright. To remind users to pay attention to these issues, Suno highlights the distinction between ownership and copyright. The rights to utilize music generated by free users are held by Suno, whereas the rights for music created by subscribed users belong to those users. However, the material produced by both users may not qualify for copyright protection [104].

The copyright issues related to AI-generated music at the output stage can be attributed to various factors. Firstly, the slow pace of legal adaptation is a significant contributor to copyright issues. Secondly, the question of creative authorship is vital, as traditional definitions typically involve human creators, whereas AI-generated music introduces algorithms and machines as contributors, potentially resulting in conflicts over copyright ownership. Finally, differing interpretations and understandings of AI music in various regions make it difficult to establish which laws apply in cross-border collaborations. For example, Suno states that copyright issues are complicated and differ from one region to another [104].

In order to promote and regulate the development of AI-generated content, some countries have implemented copyright exceptions for text data storage and mining. U.S. copyright law evaluates fair use based on purpose, nature, amount, and market impact. It recognizes transformative use, which can allow new uses without being considered infringement, even if part of the original work is copied [105]. The Directive on Copyright in the Digital Single Market, approved by the European Parliament in 2019, establishes clear guidelines for the fair use of text and data mining, allowing reasonable use of AI training materials if obtained legally and without explicit prohibition from the rights holder. Copyright laws of different countries generally consider originality and human intellectual achievements as the fundamental basis. In February 2023, the U.S. Copyright Office revoked the copyright of the comic book “Zarya of the Dawn” because the comic was generated by the AI image generation tool Midjourney, without any creative contributions from human individuals throughout the image creation process [106]. In April 2025, the Zhangjiagang City Court in China made a ruling regarding a copyright dispute involving an AI-generated child seat. The court concluded that content primarily created by AI drawing software of Midjourney should not be classified as a work. The above cases show that whether in the United States or China, the criteria for determining whether AI-generated products can obtain copyrights require human participation. In summary, the copyright issues of AI-generated music need to comprehensively consider factors such as creativity, originality, and legality. It is necessary to improve copyright laws for AI-generated music as soon as possible to balance the creative freedom of AI music with the protection of artists’ rights and interests.

4.1.2. Impact on Economies

The economics of the music industry have been changed with the increase in AI-generated music on streaming platforms. The traditional profit distribution is gradually changing, which directly affects artists, record companies, streaming service platforms and listeners. Because platforms like Spotify, Apple Music, and Amazon Music primarily distribute royalties to artists based on streaming popularity, the rise of AI-generated music on these platforms results in a shift in profit away from artists towards the entities creating AI music. For streaming platforms, utilizing AI-generated tracks is more economical than original music created by humans. However, when AI generates music by drawing on a vast array of existing works, the ambiguity in current copyright law affects the economic dynamics of the music creation industry. Based on the current situation, creators may encounter losses in two parts. Firstly, artists will suffer income losses due to the unauthorized use of their works by AI models. Secondly, as AI-generated music becomes increasingly available on streaming platforms, it will reduce the revenue that artists and record companies receive from streaming platforms. If streaming platforms lack sufficient transparency, listeners may unknowingly consume music that is entirely generated by machines and the value of real artistic creations is actually reduced. As mentioned before, the copyright laws on AI-generated music are still not fully developed. However, the trend of using AI to create music is unstoppable, so record companies no longer view AI-generated music as a threat. Instead, they are collaborating with streaming platforms to explore new business models, aiming to achieve greater revenue from music copyrights. It was reported that Universal, Warner and Sony Music were negotiating AI music rights with two startup companies in June 2025 [107]. YouTube has engaged in discussions with major record companies to obtain licenses for music in AI training, while Meta has recently broadened its collaboration with Universal Music Group.

To address the rapid advancements of AI-generated music, it is crucial to re-evaluate copyright laws and royalty structures. Traditional royalty distribution models rely on the concept of human authorship, but with the rising popularity of AI-generated music, these models may no longer be suitable. When streaming platforms use AI to create music, they should ensure transparency regarding the data sources, so that the artists, record companies, and regulators may know how the music works are used for training. Additionally, government agencies and industry associations should establish standards for watermarking technology in AI-generated music. This will help prevent unauthorized use of protected music. Streaming platforms can fairly distribute royalties to artists and record companies based on the number of subscriptions and plays of AI-generated music, along with the use of digital watermarking technology.

In addition, the extensive adoption of AI-generated music also affects employment in the music industry. On one hand, music AI enables a broader range of individuals to engage in music creation. Even those without formal music training or access to costly equipment can produce high-quality music, thereby lowering the threshold to entry in the industry. On the other hand, the advancement of music AI could alter traditional roles in the music sector. Some music production positions in films, television shows, and background music for advertisements might be replaced by AI.

4.2. Creativity of AI-Generated Music

The creativity demonstrated by AI in the field of music composition is notable. The following discusses the creativity of AI-generated music, the evaluation of that creativity, and the impact of AI on the artistic creativity of music.

4.2.1. Creativity Evaluation

Sarkar et al. defined creativity, based on their literature review, as encompassing not only the traits of novelty and value but also taking into account the creative process involved [108]. The question of whether AI-generated content can be deemed creative is a complex and contentious issue. According to the definition of computational creativity proposed by Colton et al., creativity is the capacity of a computer system to demonstrate behaviors that an unbiased observer considers as creative, although this does not necessarily imply that the system possesses actual creativity [109]. Therefore, it is necessary to evaluate the creativity of content generated by AI and to optimize the methods or processes according to the evaluation in order to enhance the creativity of the products.

Evaluating the creativity of AI-generated music is a complex and multidimensional challenge because it encompasses aspects such as artistic value, innovation, and diversity. At present, there are no universally accepted or precise evaluation methods on AI-generated music [110]. In practice, evaluations are conducted by subjective and objective evaluations. Subjective evaluation methods primarily depend on user research and questionnaire surveys to investigate respondents’ aesthetic experiences and emotional reactions to music. However, this approach consumes a lot of resources, lacks high repeatability, and may be constrained by factors such as experimental design and sample selection. Additionally, it does not allow for real-time evaluation. Objective evaluations involve utilizing quantifiable indicators derived from music theory and signal processing technology to assess the quality of the generated music. They include metrics such as loss functions, recall rates, and multiple music metrics (pitch, rhythm, chord/harmony, and style transfer) [111]. These quantitative indicators are objective but they cannot completely replace human feelings, have difficulty capturing the emotions, style and artistry of music, and lack interpretability in terms of musical creativity. Some researchers have attempted to integrate subjective and objective evaluation methods for a more comprehensive assessment of music, but these approaches still cannot completely solve the problems in objective evaluation methods [112].

Based on the existing situations, human physiological signals may be used in subjective music evaluation experiments to improve objectivity, real-time performance and stability. As a non-invasive human physiological signal, electroencephalogram (EEG) signals can reveal complex and transient neural processes and have been widely used in human creativity [113] and emotion [114] research. Photoplethysmography (PPG) is another non-invasive detection method that can capture the expansion and contraction of blood vessels resulting from heartbeats. The analysis of PPG signals is generally simpler than that of EEG in terms of acquisition, processing, feature extraction, and result interpretation, and PPG has been employed for emotion detection [115]. EEG and PPG can accurately detect indicators of human physiological states in real time, making them valuable for providing more reliable experimental data while also reducing resource consumption in subjective evaluation experiments on AI-generated music.

4.2.2. Impact on Artistic Creativity and Expression

Employing AI to process some repetitive tasks such as mixing and mastering can free musicians from dull work, allowing them to concentrate more on their creative composition. Moreover, users without formal music training can create the music they desire by simply providing text input or uploading melody music. Music AI significantly reduces the barriers to music creation and increases the overall production of musical works. Music AI has made it easier for more individuals to engage in music creation; it may also result in a large number of homogeneous works that lack innovation. Therefore, it also raises concerns regarding the democratization of the music creation process and mass production of low-quality music that lacks artistic value.

According to the information of AI music generation platform Boomy in June 2025, more than 21 million songs have been generated since its establishment in 2019 [116]. In contrast, Spotify collects a catalog of 100 million tracks from around the world, spanning from ancient to present [117]. Whether the quality of a significant amount of AI-generated music can match that of human artists within a short time remains a question. After all, current AI music generation is based on past musical materials, and with the rapid increase in quantity, its quality may regress to the mean. Although AI-generated music demonstrates impressive harmonic coherence and rhythmic appeal, it lacks narrative style, emotional expression and cultural context, which are the driving forces of human creativity [118]. If a huge number of music works are generated by simply using AI algorithms, we may fall into the trap of formulaic creation. This may lead to highly structured, algorithm-driven works that cater to public interests but lack the rich emotions expressed by human artists.

Besides the limitations of algorithms on the novelty of AI-generated music, the user’s own expertise also leads to varying levels of music. This is because although AI technology has lowered the threshold for music creation, the lack of data literacy may lead to a digital divide, preventing some users from fully utilizing the convenience brought by digitization [119]. Furthermore, if users lack the necessary music theory knowledge and skills, they may find it difficult to create high-quality works even with advanced tools [120].

Music AI can bring about the democratization of music creation, but it will also inevitably lead to the emergence of a large number of low-quality music works. In addition to optimizing the music generation model to produce more creative music, efforts can also be directed towards the following two aspects. First, it is necessary to enhance the public’s musical literacy, which can not only cultivate their music appreciation abilities but also equip potential users with the skills and understanding needed for high-level music creation and appreciation. Second, exploration of collaborative approaches between humans and AI in music creation can be investigated, using AI as an auxiliary tool to achieve sustainable development of the music production industry.

4.3. Emotional Expressiveness of AI-Generated Music

AI-generated music has advanced significantly in recent years and achieved notable success on commercial platforms. However, it continues to face challenges in terms of emotional expressiveness, which remains a critical issue that needs to be resolved. The Chinese “Records of the Grand Historian” [121] describes music as “All music originates from the human heart. The human heart is moved by things” and “Music is where sound originates. Its origin lies in the human heart and it is moved by things.” It is widely recognized that listeners are touched by musical pieces because the melodies or lyrics connect with their feelings. Besides the combination of musical symbols to express emotions, different instruments express various emotions. For instance, the flute may create a joyful atmosphere, while the xiao (Chinese vertical flute) conveys feelings of loneliness and sorrow. Even within the same instrument, variations in timbre, pitch, and tonal quality may evoke distinct emotional nuances.

4.3.1. Impact on Artistic Creativity and Expression

Although AI can imitate various music styles and generate technically proficient works, these compositions often lack the emotional depth and uniqueness that human composers bring. The reasons include several factors. Firstly, AI lacks embodiment and intrinsic feelings. Human emotional experiences are closely linked to physiological and psychological states, and it is difficult for AI to simulate this complex internal process, which leads to the superficial emotional expressiveness [122]. Secondly, there is the problem of subjective bias in AI music training. Different people may have varying emotional responses to the same piece of music in different situations. If data with emotional labels are directly mapped to music sequences for training AI models, it may lead to confusion during the learning process, which may affect the consistency of emotions in the generated music [123]. Another reason is the lack of emotional information in existing music datasets, which results in insufficient training related to emotions in music [124,125]. Creating emotional music datasets is very labor-intensive and time-consuming. In general, the emotional music datasets used for training AI are relatively small in scale, and the emotional annotations are often quite rough. This makes it difficult for AI to deeply learn the complex relationships between emotions and musical elements, resulting in generated music that lacks emotional depth.

Improving the emotional expressiveness of AI-generated music is a research topic of music AI. Incorporating emotional theories into AI music generation is an effective way. Kaitong Zheng et al. utilized music psychology emotional theories in their research. Pitch histograms and note density were selected as features, representing tonality and rhythm, respectively, to control the emotional expression in the generated music [126]. Souraja Kundu et al. proposed an emotion-guided image-to-music AI generation system, which established an emotion space to generate music that matches the emotional characteristics of a given image [127]. As for the issue of subjective bias in training, the emotional expressiveness can be improved by eliminating or reducing the subjective bias of emotional labels. Chenfei Kang et al. utilized musical attributes as a bridge between emotion and music. In the stage of mapping emotions to attributes, the attribute values near the cluster centers represent the general emotions of the samples, thereby mitigating the impact of individual differences in labeled emotional tags [123]. Additionally, some researchers analyze EEG signals to identify the emotional states of listeners and generate more personalized and emotional music based on emotional states [128,129]. Tommaso Colafiglio et al. proposed an AI polyphonic music generation system called NeuralPMG, which integrates a brain–computer interface (BCI) with a finger movement tracking device. This system can generate polyphonic music for users in different emotional states by combining EEG data and finger trajectory information [130]. With the continuous advancement of deep learning technologies and the fusion of multimodal data, music AI is expected to create music works with stronger emotional expressiveness.

4.3.2. A Multimodal Signal-Driven Emotional Music Generation System

Currently, while some progress has been achieved in using EEG to generate emotional music, there are still several limitations [128,129,130,131]. For instance, the correspondence between EEG signals and musical elements is limited. EEG signals are characterized by high dynamism and complexity, with significant individual differences, and they can be influenced by various external factors such as the subject’s mental state and environmental conditions. Additionally, since musical emotion is multidimensional, accurately establishing a reliable correspondence between EEG and musical emotions in experiments often leads to interpretability challenges.

In consideration of the limitations mentioned above, this paper proposes a multimodal emotional music generation system that combines EEG signals with eye tracking (ET), based on the author’s previous research [132], which shows that this multimodal approach can effectively assess students’ attention. ET analysis refers to a method that assesses a person’s emotional state by measuring eye movements, which include factors such as pupil diameter, pupil position, duration of fixation, and eye movement speed. Studies have shown that ET analysis can identify at least six emotions proposed by Ekman, including anger, disgust, fear, happiness, sadness and surprise [133,134]. ET offers better signal stability and real-time performance compared to EEG. Additionally, it helps eliminate artifacts caused by blinking in EEG detection. Therefore, combining EEG with ET can enhance the accuracy and stability of user emotion recognition, facilitating the generation of music with rich emotional depth. Since both EEG and ET signals are continuous time signals, the Transformer model is suitable for processing long sequences. The proposed structure of this multimodal system is shown in Figure 2.

In the experiment, videos depicting different emotions are played to the user, and the EEG and ET signals are collected synchronously. EEG signal preprocessing is performed to eliminate artifacts and extract emotion-related features such as the power of

δ

,

θ

,

α

,

β

, and

γ

frequency bands. At the same time, the users’ ET signals are acquired with eye trackers. Signal processing is carried out to extract emotion-related features from the ET signals. Subsequently, the EEG and ET feature signals are fused into multimodal data. As EEG and ET signals are continuous, it is necessary to segment the fused emotion feature signals into time windows. This process also involves encoding and quantifying the signals to generate discrete EEG/ET tokens. On the other hand, the music data is processed to extract music features such as pitch and rhythm, and to annotate emotional labels, which are then encoded into music tokens. In the Transformer model, the self-attention mechanism captures complex dependencies among tokens and extracts features related to the interaction between EEG/ET and musical emotions. The cross-attention mechanism is used to map the emotional elements from EEG/ET to music, leading to the production of emotional music. After training, the Transformer model can generate emotional music based on the EEG and ET signals produced by subjects when watching different types of videos.

The advantage of this system is that it incorporates ET signals into existing EEG-driven emotional music generation methods, enhancing the interpretability of the relationship between EEG signals and musical emotional expression. This multimodal data fusion approach is expected to improve the emotional expression capabilities of AI-generated music.

4.4. Other Challenges of AI-Generated Music

Music, as an art form that has continuously evolved alongside human history, has a complex and profound theoretical foundation, diverse styles and forms, and is closely connected to cultures around the world. Therefore, compared to other AI-generated content, AI-generated music faces more challenges.

4.4.1. Potential Training Bias

Like other AI-generated content [135], AI-generated music also faces the problems of training bias. As for AI-generated music, training bias refers to the imbalanced or biased results produced by models when generating music due to limitations in training data or biases in algorithm design. If the training dataset primarily consists of mainstream music genres such as western classical music or pop music, the AI model cannot effectively learn and generate music beyond the dataset. Therefore, it is difficult for minority music genres to be fully learned and used by the AI model, thus limiting the diversity of music generation [136].

In 2024, Atharva Mehta et al. analyzed a dataset of over one million hours used for training AI music, along with more than 200 academic papers. It showed that approximately 86% of the total dataset duration and over 93% of the researchers primarily focused on music from the Global North, while genres from the Global South accounted for only 14.6% of the dataset [137]. The concentration of research data in AI-generated music, combined with the limited variety of musical genres, hinders the diversity of global music. Additionally, on streaming platforms, AI algorithms record users’ listening habits and recommend new music for the users. Although this recommendation method broadens the range of available music, it also leads to a homogenization of the promoted music. The algorithms tend to recommend tracks that are similar to those the user has already listened to, thus limiting choices for listeners to encounter new genres and emerging artists. In some cases, this may result in a feedback loop, where mainstream and commercial music dominate playlists, making it more challenging for experimental or niche music to gain exposure. Consequently, the diversity of music creation may be negatively impacted.

To address the issue of training bias, in addition to gathering more music data from diverse cultures and styles to enhance data variety, other methods can be utilized during training. These approaches include using synthetic data [138], ensemble learning and regularized fine-tuning [139], and gradient-aware learning methods [140].

4.4.2. Difficulty in Creation

Compared to other applications of AI-generated content, such as text or images that primarily require an understanding of semantics or visual features, music creation involves a more complex set of elements. Music is not merely a simple arrangement of notes, but also a multi-level structural expression, covering rich elements such as pitch, rhythm, harmony, melody, and higher-level passages and movements. Additionally, music theory comprises many abstract concepts, such as tonality, chord functions, and counterpoints. Furthermore, different musical styles show their own unique theoretical and structural characteristics. For instance, classical music emphasizes harmonic rigor and structural symmetry, whereas jazz focuses more on improvisation and rhythmic complexity. Due to the above factors, although AI can learn and comprehend the rules and patterns of music, it may be difficult to apply these rules flexibly in innovative compositions like an experienced human composer. In the future, it is necessary to integrate music theory knowledge into deep learning models and develop algorithms that are capable of automatically analyzing and understanding musical structures to improve the quality of music.

4.5. Limitations of This Survey

This review paper aims to provide a comprehensive overview of the current development, applications, and impacts of music generation technology based on existing published studies. In adherence to the ethical guidelines and standards for review articles, we have excluded all unpublished or proprietary data from our research, including experimental results of new music generation systems. Additionally, due to the rapid development in this field, some of the latest advancements might not be fully reflected in this survey. Consequently, this survey is limited by the availability of the existing literature and the defined scope of the investigation, and therefore may not fully reflect the most recent advancements in the field. We encourage readers to refer to emerging studies for timely updates in this rapidly evolving field. Future work will focus on optimizing the proposed emotional generation system, with experimental results to be disseminated through peer-reviewed publications.

5. Conclusions

Driven by the development of technological progress, music and AI technology have been deeply integrated as music AI. Music generation is one of the representative fields of music AI. This paper surveys the development of rule-based, Markov chains, evolutionary computation and deep learning methods in music generation. The recent progress in research papers and commercial products showed that deep learning methods are powerful. The applications and challenges of music AI in the fields of music generation and composition, music rehabilitation and therapy, as well as music education and learning are also surveyed. The role of music AI is undergoing a profound change. From being initial auxiliary tools to today’s co-creators, music AI is playing an increasingly important role in various fields of music.

Although music AI has become an important driving force in the development of the music industry, it also raises philosophical and ethical questions about authorship and originality. In addition, the lack of emotional depth and insufficient creativity also restrict the artistic value of musical works. Therefore, currently, music AI cannot replace human composers and serves only as a collaborator. With the ongoing development of technology, more advanced algorithm models and innovative theories will be employed in music AI. The potential applications of music AI will become broader than ever before.

Author Contributions

Conceptualization, L.W. and Y.Y.; methodology, Y.Q. and S.Z.; writing—original draft preparation, L.W.; writing—review and editing, Y.Y., Y.Q. and S.Z.; project administration, S.Z.; funding acquisition, Y.Y. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the “Intelligent Assessment and Visualization of Learning Stress Based on EEG Rhythms and Eye Movement Signal Fusion,” a 2023 Higher Education Research Planning Project by the China Association of Higher Education (23XXK0402); “Enhancing Teaching Competence of University Faculty in the Context of Artificial Intelligence” Ministry of Education Industry–University Cooperative Education Program (230903879055748); “Innovative Experimental Project Based on Adaptive Closed-Loop Neuromodulation System”, a 2025 Sichuan Provincial Innovative Experimental Project for Undergraduate Universities (143); “Application of ’EEG Data + Artificial Intelligence’ in Comprehensive Evaluation of Higher Education Teaching,” funded by the Sichuan Network Culture Research Center, a Key Research Base for Social Sciences in Sichuan Province (WLWH23-21); “Evaluation of Higher Mathematics Teaching Models and Student Attention Based on Multimodal Information Fusion,” a Key Project of the Neijiang Municipal Philosophy and Social Sciences Planning Project (NJ2024ZD014); “Development Pathways for Smart Education in Local Universities in the Era of Artificial Intelligence,” by the Sichuan Research Center for Educational Informatization Application and Development (JYXX23-013); and “AI-Enabled Teaching Reform Project,” a university-level educational research project at Neijiang Normal University (JG202413).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, W.; Shen, L.; Huang, C.F.; Lee, J.; Zhao, X. Development Status, Frontier Hotspots, and Technical Evaluations in the Field of AI Music Composition Since the 21st Century: A Systematic Review. IEEE Access 2024, 12, 89452–89466. [Google Scholar] [CrossRef]
Miranda, E.R. Handbook of Artificial Intelligence for Music; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Briot, J.P.; Hadjeres, G.; Pachet, F.D. Deep Learning Techniques for Music Generation; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1. [Google Scholar]
Afchar, D.; Melchiorre, A.; Schedl, M.; Hennequin, R.; Epure, E.; Moussallam, M. Explainability in music recommender systems. AI Mag. 2022, 43, 190–208. [Google Scholar] [CrossRef]
Messingschlager, T.V.; Appel, M. Mind ascribed to AI and the appreciation of AI-generated art. New Media Soc. 2023, 27, 1673–1692. [Google Scholar] [CrossRef]
Williams, D.; Hodge, V.J.; Wu, C.Y. On the use of ai for generation of functional music to improve mental health. Front. Artif. Intell. 2020, 3, 497864. [Google Scholar] [CrossRef]
Shen, L.; Zhang, H.; Zhu, C.; Li, R.; Qian, K.; Meng, W.; Tian, F.; Hu, B.; Schuller, B.W.; Yamamoto, Y. A First Look at Generative Artificial Intelligence Based Music Therapy for Mental Disorders. IEEE Trans. Consum. Electron. 2024; early access. [Google Scholar] [CrossRef]
Wu, J.; Ji, Z.; Li, P. C2-MAGIC: Chord-Controllable Multi-track Accompaniment Generation with Interpretability and Creativity. In Summit on Music Intelligence; Springer: Berlin/Heidelberg, Germany, 2023; pp. 108–121. [Google Scholar]
Cope, D. Experiments in Musical Intelligence; A-R Editions: Madison, WI, USA, 1996. [Google Scholar]
Payne, C. MuseNet. OpenAI Blog 2019, 3. Available online: https://openai.com/index/musenet/ (accessed on 3 March 2025).
Briot, J.P.; Pachet, F. Deep learning for music generation: Challenges and directions. Neural Comput. Appl. 2020, 32, 981–993. [Google Scholar] [CrossRef]
Drott, E. Copyright, compensation, and commons in the music AI industry. Creat. Ind. J. 2021, 14, 190–207. [Google Scholar] [CrossRef]
Pachet, F.; Roy, P.; Carré, B. Assisted music creation with flow machines: Towards new categories of new. In Handbook of Artificial Intelligence for Music: Foundations, Advanced Approaches, and Developments for Creativity; Springer: Berlin/Heidelberg, Germany, 2021; pp. 485–520. [Google Scholar]
Schäfer, T.; Sedlmeier, P.; Städtler, C.; Huron, D. The psychological functions of music listening. Front. Psychol. 2013, 4, 511. [Google Scholar] [CrossRef]
Wang, Z.; Liu, H.; Yu, J.; Zhang, T.; Liu, Y.; Zhang, K. MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation. arXiv 2024, arXiv:2407.03188. [Google Scholar]
Supper, M. A few remarks on algorithmic composition. Comput. Music J. 2001, 25, 48–53. [Google Scholar] [CrossRef]
Tanaka, T.; Furukawa, K. Automatic melodic grammar generation for polyphonic music using a classifier system. In Proceedings of the SMC Conferences, Seoul, Republic of Korea, 14–17 October 2012; pp. 150–156. [Google Scholar]
Pachet, F. Interacting with a musical learning system: The continuator. In Proceedings of the International Conference on Music and Artificial Intelligence, Edinburgh, UK, 12–14 September 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 119–132. [Google Scholar]
Pachet, F.; Roy, P. Markov constraints: Steerable generation of Markov sequences. Constraints 2011, 16, 148–172. [Google Scholar] [CrossRef]
Papadopoulos, A.; Roy, P.; Pachet, F. Assisted lead sheet composition using flowcomposer. In Proceedings of the Principles and Practice of Constraint Programming: 22nd International Conference, CP 2016, Toulouse, France, 5–9 September 2016; Proceedings 22. Springer: Berlin/Heidelberg, Germany, 2016; pp. 769–785. [Google Scholar]
Huang, C.F.; Lian, Y.S.; Nien, W.P.; Chieng, W.H. Analyzing the perception of Chinese melodic imagery and its application to automated composition. Multimed. Tools Appl. 2016, 75, 7631–7654. [Google Scholar] [CrossRef]
Simon, I.; Morris, D.; Basu, S. MySong: Automatic accompaniment generation for vocal melodies. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Florence, Italy, 5–10 April 2008; pp. 725–734. [Google Scholar]
Williams, D.; Hodge, V.J.; Gega, L.; Murphy, D.; Cowling, P.I.; Drachen, A. AI and automatic music generation for mindfulness. In Proceedings of the 2019 AES International Conference on Immersive and Interactive Audio: Creating the Next Dimension of Sound Experience, York, UK, 27–29 March 2019. [Google Scholar]
Horner, A.; Goldberg, D.E. Genetic Algorithms and Computer-Assisted Music Composition; Michigan Publishing, University of Michigan Library: Ann Arbor, MI, USA, 1991; Volume 51. [Google Scholar]
Biles, J. GenJam: A genetic algorithm for generating jazz solos. In Proceedings of the ICMC, Aarhus, Denmark, 12–17 September 1994; ICMC: Ann Arbor, MI, USA, 1994; Volume 94, pp. 131–137. [Google Scholar]
Dahlstedt, P.; Nordahl, M.G. Living melodies: Coevolution of sonic communication. Leonardo 2001, 34, 243–248. [Google Scholar] [CrossRef]
Scirea, M.; Togelius, J.; Eklund, P.; Risi, S. Affective evolutionary music composition with MetaCompose. Genet. Program. Evolvable Mach. 2017, 18, 433–465. [Google Scholar] [CrossRef]
De Prisco, R.; Zaccagnino, G.; Zaccagnino, R. EvoComposer: An evolutionary algorithm for 4-voice music compositions. Evol. Comput. 2020, 28, 489–530. [Google Scholar] [CrossRef]
Todd, P.M. A connectionist approach to algorithmic composition. Comput. Music J. 1989, 13, 27–43. [Google Scholar] [CrossRef]
Mozer, M.C. Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multi-scale processing. Connect. Sci. 1994, 6, 247–280. [Google Scholar] [CrossRef]
Eck, D.; Schmidhuber, J. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland, 6 September 2002; IEEE: Piscataway, NJ, USA, 2002; pp. 747–756. [Google Scholar]
Ji, S.; Yang, X.; Luo, J. A survey on deep learning for symbolic music generation: Representations, algorithms, evaluations, and challenges. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Hadjeres, G.; Pachet, F.; Nielsen, F. Deepbach: A steerable model for bach chorales generation. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1362–1371. [Google Scholar]
Wu, J.; Hu, C.; Wang, Y.; Hu, X.; Zhu, J. A hierarchical recurrent neural network for symbolic melody generation. IEEE Trans. Cybern. 2019, 50, 2749–2757. [Google Scholar] [CrossRef]
Yang, L.C.; Chou, S.Y.; Yang, Y.H. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv 2017, arXiv:1703.10847. [Google Scholar]
Dong, H.W.; Hsiao, W.Y.; Yang, L.C.; Yang, Y.H. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Trieu, N.; Keller, R. JazzGAN: Improvising with generative adversarial networks. In Proceedings of the MUME Workshop, Salamanca, Spain, 25–26 June 2018. [Google Scholar]
Donahue, C.; McAuley, J.; Puckette, M. Adversarial audio synthesis. arXiv 2018, arXiv:1802.04208. [Google Scholar]
Engel, J.; Agrawal, K.K.; Chen, S.; Gulrajani, I.; Donahue, C.; Roberts, A. Gansynth: Adversarial neural audio synthesis. arXiv 2019, arXiv:1902.08710. [Google Scholar] [CrossRef]
Brunner, G.; Konrad, A.; Wang, Y.; Wattenhofer, R. MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer. arXiv 2018, arXiv:1809.07600. [Google Scholar] [CrossRef]
Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; Eck, D. A hierarchical latent vector model for learning long-term structure in music. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4364–4373. [Google Scholar]
Luo, J.; Yang, X.; Ji, S.; Li, J. MG-VAE: Deep Chinese folk songs generation with specific regional styles. In Proceedings of the 7th Conference on Sound and Music Technology (CSMT) Revised Selected Papers, Harbin, China, 26–29 December 2019; Springer: Singapore, 2020; pp. 93–106. [Google Scholar]
Wang, Z.; Zhang, Y.; Zhang, Y.; Jiang, J.; Yang, R.; Zhao, J.; Xia, G. Pianotree vae: Structured representation learning for polyphonic music. arXiv 2020, arXiv:2008.07118. [Google Scholar] [CrossRef]
Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer: Generating Music with Long-Term Structure. arXiv 2018, arXiv:1809.04281. [Google Scholar]
Dhariwal, P.; Jun, H.; Payne, C.; Kim, J.W.; Radford, A.; Sutskever, I. Jukebox: A generative model for music. arXiv 2020, arXiv:2005.00341. [Google Scholar] [CrossRef]
Agostinelli, A.; Denk, T.I.; Borsos, Z.; Engel, J.; Verzetti, M.; Caillon, A.; Huang, Q.; Jansen, A.; Roberts, A.; Tagliasacchi, M.; et al. Musiclm: Generating music from text. arXiv 2023, arXiv:2301.11325. [Google Scholar] [CrossRef]
Copet, J.; Kreuk, F.; Gat, I.; Remez, T.; Kant, D.; Synnaeve, G.; Adi, Y.; Défossez, A. Simple and controllable music generation. Adv. Neural Inf. Process. Syst. 2023, 36, 47704–47720. [Google Scholar]
Forsgren, S.; Martiros, H. Riffusion-Stable Diffusion for Real-Time Music Generation. 2022. Available online: https://riffusion.com (accessed on 3 May 2025).
Huang, Q.; Park, D.S.; Wang, T.; Denk, T.I.; Ly, A.; Chen, N.; Zhang, Z.; Zhang, Z.; Yu, J.; Frank, C.; et al. Noise2music: Text-conditioned music generation with diffusion models. arXiv 2023, arXiv:2302.03917. [Google Scholar]
Schneider, F.; Kamal, O.; Jin, Z.; Schölkopf, B. Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv 2023, arXiv:2301.11757. [Google Scholar]
Ning, Z.; Chen, H.; Jiang, Y.; Hao, C.; Ma, G.; Wang, S.; Yao, J.; Xie, L. DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion. arXiv 2025, arXiv:2503.01183. [Google Scholar]
Yu, J.; Wu, S.; Lu, G.; Li, Z.; Zhou, L.; Zhang, K. Suno: Potential, prospects, and trends. Front. Inf. Technol. Electron. Eng. 2024, 25, 1025–1030. [Google Scholar] [CrossRef]
Friberg, A. Generative rules for music performance: A formal description of a rule system. Comput. Music J. 1991, 15, 56–71. [Google Scholar] [CrossRef][Green Version]
Liu, C.H.; Ting, C.K. Computational intelligence in music composition: A survey. IEEE Trans. Emerg. Top. Comput. Intell. 2016, 1, 2–15. [Google Scholar] [CrossRef]
Roads, C.; Mathews, M. Interview with max mathews. Comput. Music J. 1980, 4, 15–22. [Google Scholar] [CrossRef]
Chomsky, N. Logical structure in language. J. Am. Soc. Inf. Sci. 1957, 8, 284. [Google Scholar] [CrossRef]
Lindenmayer, A. Mathematical models for cellular interactions in development I. Filaments with one-sided inputs. J. Theor. Biol. 1968, 18, 280–299. [Google Scholar] [CrossRef] [PubMed]
Prusinkiewicz, P. Score generation with L-systems. In Proceedings of the ICMC, Den Haag, The Netherlands, 20–24 October 1986; ICMC: Ann Arbor, MI, USA, 1986; pp. 455–457. [Google Scholar]
Pachet, F. Creativity studies and musical interaction. In Musical Creativity; Psychology Press: London, UK, 2006; pp. 363–374. [Google Scholar]
Wang, L.; Zhao, Z.; Liu, H.; Pang, J.; Qin, Y.; Wu, Q. A review of intelligent music generation systems. Neural Comput. Appl. 2024, 36, 6381–6401. [Google Scholar] [CrossRef]
Sandred, O.; Laurson, M.; Kuuskankare, M. Revisiting the Illiac Suite–a rule-based approach to stochastic processes. Sonic Ideas/Ideas Sonicas 2009, 2, 42–46. [Google Scholar]
Thyer, M.; Kuczera, G. Modeling long-term persistence in hydroclimatic time series using a hidden state Markov model. Water Resour. Res. 2000, 36, 3301–3310. [Google Scholar] [CrossRef]
Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Mor, B.; Garhwal, S.; Kumar, A. A systematic review of hidden Markov models and their applications. Arch. Comput. Methods Eng. 2021, 28, 1429–1448. [Google Scholar] [CrossRef]
Back, T.; Hammel, U.; Schwefel, H.P. Evolutionary computation: Comments on the history and current state. IEEE Trans. Evol. Comput. 1997, 1, 3–17. [Google Scholar] [CrossRef]
Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
Berry, R.; Dahlstedt, P. Artificial Life: Why should musicians bother? Contemp. Music Rev. 2003, 22, 57–67. [Google Scholar] [CrossRef]
Eck, D.; Schmidhuber, J. A first look at music composition using lstm recurrent neural networks. Ist. Dalle Molle Studi Sull Intell. Artif. 2002, 103, 48–56. [Google Scholar]
Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness-Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Bisharad, D.; Laskar, R.H. Music genre recognition using convolutional recurrent neural network architecture. Expert Syst. 2019, 36, e12429. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Proceedings of the NIPS 2014, Montreal, QC, Canada, 8–13 December 2014; NeurlPS, 2014; Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/f033ed80deb0234979a61f95710dbe25-Paper.pdf (accessed on 22 April 2025).
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Simon, I.; Roberts, A.; Raffel, C.; Engel, J.; Hawthorne, C.; Eck, D. Learning a latent space of multitrack measures. arXiv 2018, arXiv:1806.00195. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; NeurlPS, 2017; Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 22 April 2025).
Zhang, N. Learning adversarial transformer for symbolic music generation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 34, 1754–1763. [Google Scholar] [CrossRef]
Jiang, J.; Xia, G.G.; Carlton, D.B.; Anderson, C.N.; Miyakawa, R.H. Transformer vae: A hierarchical model for structure-aware and interpretable music representation learning. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 516–520. [Google Scholar]
Wang, W.; Li, J.; Li, Y.; Xing, X. Style-conditioned music generation with Transformer-GANs. Front. Inf. Technol. Electron. Eng. 2024, 25, 106–120. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
Han, S. AI, culture industries and entertainment. In The Routledge Social Science Handbook of AI; Routledge: Boca Raton, FL, USA, 2021; pp. 295–312. [Google Scholar]
Huang, C.Z.A.; Hawthorne, C.; Roberts, A.; Dinculescu, M.; Wexler, J.; Hong, L.; Howcroft, J. The bach doodle: Approachable music composition with machine learning at scale. arXiv 2019, arXiv:1907.06637. [Google Scholar] [CrossRef]
Zulić, H. How AI can change/improve/influence music composition, performance and education: Three case studies. INSAM J. Contemp. Music. Art Technol. 2019, 100–114. [Google Scholar] [CrossRef]
Behr, A. Now and Then: Enabled by AI–Created by Profound Connections Between the Four Beatles. The Conversation. 2023. Available online: https://theconversation.com/now-and-then-enabled-by-ai-created-by-profound-connections-between-the-four-beatles-216920 (accessed on 15 April 2025).
Thaut, M.H.; McIntosh, G.C.; Hoemberg, V. Neurobiological foundations of neurologic music therapy: Rhythmic entrainment and the motor system. Front. Psychol. 2015, 5, 1185. [Google Scholar] [CrossRef]
Damm, L.; Varoqui, D.; De Cock, V.C.; Dalla Bella, S.; Bardy, B. Why do we move to the beat? A multi-scale approach, from physical principles to brain dynamics. Neurosci. Biobehav. Rev. 2020, 112, 553–584. [Google Scholar] [CrossRef] [PubMed]
Marquez-Garcia, A.V.; Magnuson, J.; Morris, J.; Iarocci, G.; Doesburg, S.; Moreno, S. Music therapy in autism spectrum disorder: A systematic review. Rev. J. Autism Dev. Disord. 2022, 9, 91–107. [Google Scholar] [CrossRef]
Eftychios, A.; Nektarios, S.; Nikoleta, G. Alzheimer disease and music-therapy: An interesting therapeutic challenge and proposal. Adv. Alzheimer’s Dis. 2021, 10, 1–18. [Google Scholar] [CrossRef]
Tang, H.; Chen, L.; Wang, Y.; Zhang, Y.; Yang, N.; Yang, N. The efficacy of music therapy to relieve pain, anxiety, and promote sleep quality, in patients with small cell lung cancer receiving platinum-based chemotherapy. Support. Care Cancer 2021, 29, 7299–7306. [Google Scholar] [CrossRef]
Chomiak, T.; Sidhu, A.S.; Watts, A.; Su, L.; Graham, B.; Wu, J.; Classen, S.; Falter, B.; Hu, B. Development and validation of ambulosono: A wearable sensor for bio-feedback rehabilitation training. Sensors 2019, 19, 686. [Google Scholar] [CrossRef]
Wittwer, J.E.; Winbolt, M.; Morris, M.E. Home-based gait training using rhythmic auditory cues in Alzheimer’s disease: Feasibility and outcomes. Front. Med. 2020, 6, 335. [Google Scholar] [CrossRef]
Gonzalez-Hoelling, S.; Bertran-Noguer, C.; Reig-Garcia, G.; Suñer-Soler, R. Effects of a music-based rhythmic auditory stimulation on gait and balance in subacute stroke. Int. J. Environ. Res. Public Health 2021, 18, 2032. [Google Scholar] [CrossRef]
Grau-Sánchez, J.; Münte, T.F.; Altenmüller, E.; Duarte, E.; Rodríguez-Fornells, A. Potential benefits of music playing in stroke upper limb motor rehabilitation. Neurosci. Biobehav. Rev. 2020, 112, 585–599. [Google Scholar] [CrossRef]
Ghai, S.; Maso, F.D.; Ogourtsova, T.; Porxas, A.X.; Villeneuve, M.; Penhune, V.; Boudrias, M.H.; Baillet, S.; Lamontagne, A. Neurophysiological changes induced by music-supported therapy for recovering upper extremity function after stroke: A case series. Brain Sci. 2021, 11, 666. [Google Scholar] [CrossRef]
Dogruoz Karatekin, B.; Icagasioglu, A. The effect of therapeutic instrumental music performance method on upper extremity functions in adolescent cerebral palsy. Acta Neurol. Belg. 2021, 121, 1179–1189. [Google Scholar] [CrossRef]
Mina, F.; Darweesh, M.E.S.; Khattab, A.N.; Serag, S.M. Role and efficacy of music therapy in learning disability: A systematic review. Egypt. J. Otolaryngol. 2021, 37, 1–12. [Google Scholar] [CrossRef]
Butzlaff, R. Can music be used to teach reading? J. Aesthetic Educ. 2000, 34, 167–178. [Google Scholar] [CrossRef]
Okely, J.A.; Overy, K.; Deary, I.J. Experience of playing a musical instrument and lifetime change in general cognitive ability: Evidence from the lothian birth cohort 1936. Psychol. Sci. 2022, 33, 1495–1508. [Google Scholar] [CrossRef] [PubMed]
Morrison, B. Millions of U.S. Students Denied Access to Music Education, According to First-Ever National Study—Arts Education Data Project—artseddata.org. Available online: https://artseddata.org/millions-of-u-s-students-denied-access-to-music-education-according-to-first-ever-national-study/ (accessed on 26 March 2025).
Ruth, N.; Müllensiefen, D. Survival of musical activities. When do young people stop making music? PLoS ONE 2021, 16, e0259105. [Google Scholar] [CrossRef] [PubMed]
Wei, J.; Karuppiah, M.; Prathik, A. College music education and teaching based on AI techniques. Comput. Electr. Eng. 2022, 100, 107851. [Google Scholar] [CrossRef]
Wu, Q. The application of artificial intelligence in music education management: Opportunities and challenges. J. Comput. Methods Sci. Eng. 2024, 25, 2836–2848. [Google Scholar] [CrossRef]
Li, S. The impact of AI-driven music production software on the economics of the music industry. Inf. Dev. 2025, 02666669241312170. [Google Scholar] [CrossRef]
Alexander, A. “Heart on My Sleeve”: An AI-Created Hit Song Mimicking Drake and The Weeknd Goes Viral; SAGE Business Cases Originals; SAGE Publications: Thousand Oaks, CA, USA, 2024. [Google Scholar]
Suno. Do I Have the Copyrights to Songs I Made? 2025. Available online: https://help.suno.com/en/articles/2746945 (accessed on 16 June 2025).
Gruetzemacher, R.; Whittlestone, J. The transformative potential of artificial intelligence. Futures 2022, 135, 102884. [Google Scholar] [CrossRef]
Lee, E. Prompting progress: Authorship in the age of AI. Fla. Law Rev. 2024, 76, 1445. [Google Scholar] [CrossRef]
Steele, A. Universal, Warner and Sony Are Negotiating AI Licensing Rights for Music. 2025. Available online: https://www.wsj.com/business/media/ai-music-licensing-universal-warner-sony-92bcbc0d (accessed on 16 June 2025).
Sarkar, P.; Chakrabarti, A. Studying engineering design creativity-developing a common definition and associated measures. In Proceedings of the NSF International Workshop on Studying Design Creativity’08, Aix-en-Provence, France, 10–11 March 2008. [Google Scholar]
Colton, S.; Wiggins, G.A. Computational creativity: The final frontier? In Proceedings of the ECAI 2012, Montpellier, France, 27–31 August 2012; IOS Press: Amsterdam, The Netherlands, 2012; pp. 21–26. [Google Scholar]
Carnovalini, F.; Rodà, A. Computational creativity and music generation systems: An introduction to the state of the art. Front. Artif. Intell. 2020, 3, 14. [Google Scholar] [CrossRef]
Ji, S.; Luo, J.; Yang, X. A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions. arXiv 2020, arXiv:2011.06801. [Google Scholar] [CrossRef]
Huang, J.; Wang, J.C.; Smith, J.B.; Song, X.; Wang, Y. Modeling the compatibility of stem tracks to generate music mashups. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 187–195. [Google Scholar]
Stevens Jr, C.E.; Zabelina, D.L. Creativity comes in waves: An EEG-focused exploration of the creative brain. Curr. Opin. Behav. Sci. 2019, 27, 154–162. [Google Scholar] [CrossRef]
Samal, P.; Hashmi, M.F. Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: A review. Artif. Intell. Rev. 2024, 57, 50. [Google Scholar] [CrossRef]
Lee, M.S.; Lee, Y.K.; Pae, D.S.; Lim, M.T.; Kim, D.W.; Kang, T.K. Fast emotion recognition based on single pulse PPG signal with convolutional neural network. Appl. Sci. 2019, 9, 3355. [Google Scholar] [CrossRef]
Corporation, B. Unleash Your Creativity Make Music with Boomy AI. 2025. Available online: https://boomy.com/ (accessed on 19 June 2025).
Spotify. About Spotify. 2025. Available online: https://newsroom.spotify.com/company-info/ (accessed on 19 June 2025).
Lecamwasam, K.; Chaudhuri, T.R. Exploring listeners’ perceptions of AI-generated and human-composed music for functional emotional applications. arXiv 2025, arXiv:2506.02856. [Google Scholar]
Hagen, A.N. Datafication, literacy, and democratization in the music industry. Pop. Music Soc. 2022, 45, 184–201. [Google Scholar] [CrossRef]
Weng, S.S.; Chen, H.C. Exploring the role of deep learning technology in the sustainable development of the music production industry. Sustainability 2020, 12, 625. [Google Scholar] [CrossRef]
Qian, S.; Watson, B. Records of the Grand Historian; Columbia University Press: New York, NY, USA, 1993; Volume 1. [Google Scholar]
Novelli, N.; Proksch, S. Am I (deep) blue? Music-making ai and emotional awareness. Front. Neurorobot. 2022, 16, 897110. [Google Scholar] [CrossRef]
Kang, C.; Lu, P.; Yu, B.; Tan, X.; Ye, W.; Zhang, S.; Bian, J. EmoGen: Eliminating subjective bias in emotional music generation. arXiv 2023, arXiv:2307.01229. [Google Scholar] [CrossRef]
Ji, S.; Yang, X. Emomusictv: Emotion-conditioned symbolic music generation with hierarchical transformer vae. IEEE Trans. Multimed. 2023, 26, 1076–1088. [Google Scholar] [CrossRef]
Yao, W.; Chen, C.P.; Zhang, Z.; Zhang, T. AE-AMT: Attribute-Enhanced Affective Music Generation with Compound Word Representation. IEEE Trans. Comput. Soc. Syst. 2024, 12, 890–904. [Google Scholar] [CrossRef]
Zheng, K.; Meng, R.; Zheng, C.; Li, X.; Sang, J.; Cai, J.; Wang, J.; Wang, X. EmotionBox: A music-element-driven emotional music generation system based on music psychology. Front. Psychol. 2022, 13, 841926. [Google Scholar] [CrossRef] [PubMed]
Kundu, S.; Singh, S.; Iwahori, Y. Emotion-Guided Image to Music Generation. arXiv 2024, arXiv:2410.22299. [Google Scholar] [CrossRef]
Gomez-Morales, O.; Perez-Nastar, H.; Álvarez-Meza, A.M.; Torres-Cardona, H.; Castellanos-Dominguez, G. EEG-Based Music Emotion Prediction Using Supervised Feature Extraction for MIDI Generation. Sensors 2025, 25, 1471. [Google Scholar] [CrossRef] [PubMed]
Ran, S.; Zhong, W.; Ma, L.; Duan, D.; Ye, L.; Zhang, Q. Mind to Music: An EEG Signal-Driven Real-Time Emotional Music Generation System. Int. J. Intell. Syst. 2024, 2024, 9618884. [Google Scholar] [CrossRef]
Colafiglio, T.; Ardito, C.; Sorino, P.; Lofù, D.; Festa, F.; Di Noia, T.; Di Sciascio, E. Neuralpmg: A neural polyphonic music generation system based on machine learning algorithms. Cogn. Comput. 2024, 16, 2779–2802. [Google Scholar] [CrossRef]
Jiang, H.; Chen, Y.; Wu, D.; Yan, J. EEG-driven automatic generation of emotive music based on transformer. Front. Neurorobot. 2024, 18, 1437737. [Google Scholar] [CrossRef]
Qin, Y.; Yang, J.; Zhang, M.; Zhang, M.; Kuang, J.; Yu, Y.; Zhang, S. Construction of a Quality Evaluation System for University Course Teaching Based on Multimodal Brain Data. Recent Patents Eng. 2025; in press. [Google Scholar]
Lim, J.Z.; Mountstephens, J.; Teo, J. Emotion recognition using eye-tracking: Taxonomy, review and current challenges. Sensors 2020, 20, 2384. [Google Scholar] [CrossRef]
Ekman, P.; Dalgleish, T.; Power, M. Basic Emotions; John Wiley & Sons Ltd.: San Francisco, CA, USA, 1999. [Google Scholar]
Kim, B.; Kim, H.; Kim, K.; Kim, S.; Kim, J. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9012–9020. [Google Scholar]
Bryan-Kinns, N.; Li, Z. Reducing Barriers to the Use of Marginalised Music Genres in AI. arXiv 2024, arXiv:2407.13439. [Google Scholar] [CrossRef]
Mehta, A.; Chauhan, S.; Choudhury, M. Missing Melodies: AI Music Generation and its “Nearly” Complete Omission of the Global South. arXiv 2024, arXiv:2412.04100. [Google Scholar] [CrossRef]
Shahul Hameed, M.A.; Qureshi, A.M.; Kaushik, A. Bias mitigation via synthetic data generation: A review. Electronics 2024, 13, 3909. [Google Scholar] [CrossRef]
Radwan, A.; Zaafarani, L.; Abudawood, J.; AlZahrani, F.; Fourati, F. Addressing bias through ensemble learning and regularized fine-tuning. arXiv 2024, arXiv:2402.00910. [Google Scholar] [CrossRef]
Zhang, S.; Zhu, C.; Li, H.; Cai, J.; Yang, L. Gradient-aware learning for joint biases: Label noise and class imbalance. Neural Netw. 2024, 171, 374–382. [Google Scholar] [CrossRef]

Figure 1. The roadmap of music generation with music AI [3,9,10,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52].

Figure 2. The structure of the EEG eye-tracking signal-driven emotional music generation system.

Table 1. Comparison of models for music generation.

Model	Advantages	Challenges	Scenarios
Rule-based	Comprehensible for humans, suitable for a large amount of music, no training.	Lack of creativity and flexibility in music, poor versatility and scalability.	Fixed style of music, music education.
Markov chain	Easy to control, allows adding constraints.	Has difficulty capturing long time structures, repetition of melodies in corpus.	Game background music, repetitive electronic music clips.
Evolutionary computation	Good adaptability to feedback and evaluation.	Time-consuming, quality relies on fitness function.	Interactive music, game dynamic soundtrack, improvised music.
RNN	Strong at capturing temporal structure in music.	Poor quality in long time sequences, over-fitting, lack of diversity.	Short or simple melodies in real time, classical music.
GAN	High-quality music, diverse and unique style music in real time.	Lots of computing resources/time for training, unstable in training.	Complex music with multiple tracks, electronic music.
VAE	Diverse and complex musical pieces, flexible and controllable, fast training speed.	Low quality and lacks sharpness and detail, relatively complicated training.	Diverse and complex musical pieces, different style interpolation.
Transformer	Excellent at capturing long-range dependencies and complex musical structures.	Requires significant computational resources and large amounts of data.	Music with complex arrangements and long sequences, composition of multiple tracks.
Diffusion models	High-quality audio, capturing complex audio characteristics, stable training.	Long training generation time, high computational resources.	High-fidelity music, professional audio.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, L.; Yu, Y.; Qin, Y.; Zhang, S. From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation. Information 2025, 16, 656. https://doi.org/10.3390/info16080656

AMA Style

Wei L, Yu Y, Qin Y, Zhang S. From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation. Information. 2025; 16(8):656. https://doi.org/10.3390/info16080656

Chicago/Turabian Style

Wei, Lijun, Yuanyu Yu, Yuping Qin, and Shuang Zhang. 2025. "From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation" Information 16, no. 8: 656. https://doi.org/10.3390/info16080656

APA Style

Wei, L., Yu, Y., Qin, Y., & Zhang, S. (2025). From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation. Information, 16(8), 656. https://doi.org/10.3390/info16080656

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Tools to Creators: A Review on the Development and Application of Artificial Intelligence Music Generation

Abstract

1. Introduction

2. Music Generation

2.1. Rule-Based Method

2.2. Markov Chain

2.3. Evolutionary Computation

2.4. Deep Learning

2.4.1. Recurrent Neural Network

2.4.2. Generative Adversarial Network

2.4.3. Variational Autoencoder

2.4.4. Transformer Architecture

2.4.5. Diffusion Model

2.5. Summary of Music Generation Models

3. Applications of Music AI

3.1. Music Generation and Composition

3.2. Music Rehabilitation and Therapy

3.3. Music Education and Learning

3.4. Summary

4. Discussion

4.1. Impact on the Music Industry

4.1.1. Impact on Intellectual Property

4.1.2. Impact on Economies

4.2. Creativity of AI-Generated Music

4.2.1. Creativity Evaluation

4.2.2. Impact on Artistic Creativity and Expression

4.3. Emotional Expressiveness of AI-Generated Music

4.3.1. Impact on Artistic Creativity and Expression

4.3.2. A Multimodal Signal-Driven Emotional Music Generation System

4.4. Other Challenges of AI-Generated Music

4.4.1. Potential Training Bias

4.4.2. Difficulty in Creation

4.5. Limitations of This Survey

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI