Perspectives on Generative Sound Design: A Generative Soundscapes Showcase

Samson, Grzegorz

doi:10.3390/arts14030067

Open AccessArticle

Perspectives on Generative Sound Design: A Generative Soundscapes Showcase

by

Grzegorz Samson

Faculty of Composition, Theory of Music and Sound Engineering, The Feliks Nowowiejski Academy of Music in Bydgoszcz, Słowackiego 7, 85-008 Bydgoszcz, Poland

Arts 2025, 14(3), 67; https://doi.org/10.3390/arts14030067

Submission received: 19 April 2025 / Revised: 30 May 2025 / Accepted: 3 June 2025 / Published: 12 June 2025

(This article belongs to the Special Issue Sound, Space, and Creativity in Performing Arts)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Recent advancements in generative neural networks, particularly transformer-based models, have introduced novel possibilities for sound design. This study explores the use of generative pre-trained transformers (GPT) to create complex, multilayered soundscapes from textual and visual prompts. A custom pipeline is proposed, featuring modules for converting the source input into structured sound descriptions and subsequently generating cohesive auditory outputs. As a complementary solution, a granular synthesizer prototype was developed to enhance the usability of generative audio samples by enabling their recombination into seamless and non-repetitive soundscapes. The integration of GPT models with granular synthesis demonstrates significant potential for innovative audio production, paving the way for advancements in professional sound-design workflows and immersive audio applications.

Keywords:

generative transformers; soundscapes; granular synthesis; sound design; prompt engineering

Graphical Abstract

1. Introduction

1.1. Transformers in Media Generation

The presence of media created by artificial neural networks, often labeled as “Artificial Intelligence” (AI), is becoming more prevalent in many areas of daily life. It has reached a level where it can both deceive and captivate human audiences. This is particularly evident in creative domains, where AI’s ability to mimic human artistry continues to blur the lines between machine- and human-generated content. A recent study in Nature revealed that AI-generated poetry is becoming increasingly indistinguishable from human work (Porter and Machery 2024). However, AI has the potential to expand existing media by introducing new, contextually dependent material, enriching both the content and the experience it offers (Samson 2024).

1.2. Historical Context

1.2.1. The Beginnings of Generative Music

In the history of music, the composition has involved meticulous preplanning and crafting of fixed sequences of notes, with classical composers like Bach and Beethoven establishing structured frameworks that combined careful planning with space for interpretation (Taruskin and Gibbs 2013). Bach’s basso continuo notation, for instance, provided a harmonic foundation while allowing performers to interpret and elaborate on the given material, leading to subtle variations between performances. Similarly, dynamics, phrasing, and ornamentation were shaped by a performer’s individual style, introducing an element of “human randomness” into otherwise structured compositions.

An early example of deliberately incorporating stochastic processes into music can be found in Mozart’s Musikalisches Würfelspiel (Musical Dice Game). This compositional tool uses dice rolls to assemble measures of music from predefined options, creating a unique piece with each play. While Mozart provided the building blocks, the element of chance determined the final outcome, blending deterministic composition with probabilistic execution evoking inspiration in modern composers (Wayne 2023).

1.2.2. Stochastic Soundscapes Origins

Outside the musical tradition, generative soundscapes have a rich historical precedent. Listening to natural phenomena, such as wind, waves, or rain, inspires a form of sonic awareness rooted in change and irregularity. In many cultures, wind chimes or water-driven instruments were created to channel these forces into sound-producing structures. These devices exemplify an early form of procedural sound design, where their creators set physical parameters, but the output was determined by environmental variability, such as wind speed or water flow (Toop 1995; Schafer 1993).

One notable example is the aeolian harp, a stringed instrument played by the wind, which dates back to ancient Greece and was later popularized in the Romantic era (Blesser and Salter 2007). Its ethereal tones, shaped by airflow rather than human agency, represent the early fusion of natural indeterminacy with intentional sonic construction. Similarly, sea organs, such as the modern Sea Organ in Zadar, Croatia, transform the irregular force of ocean waves into evolving harmonic textures by channeling air through tuned pipes (Stamać 2005).

1.2.3. Analogue Electronics for Generative Sound

The advent of analogue technology in the 20th century expanded the boundaries of sound creation. Composers such as Edgard Varèse and Karlheinz Stockhausen began experimenting with electronic sound generators, tape manipulation, and feedback systems (Griffiths 2010). These innovations introduced the concept of semi-automated sound creation, wherein the composer defined a system or process but left room for variability within its execution.

This era marked a philosophical shift, as composers like John Cage embraced indeterminacy, surrendering control over certain musical parameters to allow systems or performers to introduce randomness. Iannis Xenakis expanded this approach by applying mathematical models, such as probability theory and Markov chains, to shape musical structures (Xenakis and Kanach 1992).

However, this creative explosion would not have been possible without the parallel innovations of electronic engineers. The development of modular synthesizers, pioneered by figures such as Robert Moog and Don Buchla, brought voltage-controlled oscillators, filters, envelope generators, and operational amplifiers into the hands of artists, giving them even more space for creative exploration, parametrization, and randomization (Pinch and Trocco 2009).

1.2.4. Digital Electronics for Generative Sound

With the transition to digital technology in the late 20th century, the potential for computationally driven sound design has expanded exponentially. Digital synthesizers, samplers, and music software provide unprecedented precision and flexibility. At the same time, composers began exploring stochastic and algorithmic composition methods, using probability and randomness as creative tools. Visual programming environments, like Max/MSP and Pure Data, allow composers to build custom generative audio systems (Holmes 2020).

1.2.5. Transformers for Generative Sound Design

The current era of generative audio is driven by machine learning advancements. The use of generative neural networks for sound design is a dynamically developing field of research (Ji et al. 2020; Koutini et al. 2021). Owing to advances in Generative Pre-trained Transformer (GPT) networks (Huang et al. 2023b; Kreuk et al. 2023; Le et al. 2023; Verma and Berger 2021; Verma and Chafe 2021; Vyas et al. 2023), it has become possible to create complex soundscapes based on textual and visual prompts in an automated manner. As the potential of new generative technologies grows, there is a need to understand how sound designers can effectively use these new technologies and evaluate their capabilities (Ashvala and Lerch 2022; Oh et al. 2023).

The aim of this article is to explore how generative transformers can be used to create multilayered generative soundscapes, showcasing their potential in the field of sound design.

2. Related Work

2.1. A Definition of Soundscape

A soundscape refers to the acoustic environment as perceived or experienced by an individual, encompassing all audible components, both natural and man-made. For example, a forest soundscape may include the rustling of leaves, chirping of birds, and distant running water, while an urban soundscape may consist of traffic noise, human chatter, and construction sounds. Coined by composer and researcher R. Murray Schafer, the concept plays a central role in his theory of Acoustic Ecology, which studies the relationship between humans and their sonic environments (Schafer 1993). Schafer introduced key distinctions within soundscapes, such as keynotes (background sounds), signals (foreground sounds demanding attention), and sound marks (unique sounds with cultural or geographical significance). These elements highlight how soundscapes shape our perception of the world and influence our interactions with our environment.

In Acoustic Ecology, the quality of a soundscape is also evaluated, with an emphasis on preserving “hi-fi” environments—characterized by clarity and low ambient noise, such as in rural countryside—over “lo-fi” environments, where excessive noise masks sonic details. The soundscape concept extends beyond ecological concerns, influencing contemporary sound design and composition, where it is used to create immersive auditory experiences that reflect, interpret, and transform real-world sonic environments.

2.2. Generative Soundscapes

Research on generative soundscapes has a history, with previous studies exploring diverse technologies and methodologies for creating dynamic auditory environments. A generative soundscape creation study (Birchfield et al. 2005) introduced a framework for real-time soundscape generation rooted in the principles of Acoustic Ecology. This approach utilized keynotes and signals from various locations as foundational elements, dynamically managed through a probabilistic system implemented in Max/MSP that was adapted to user behavior and interaction. A later study (Eigenfeldt and Pasquier 2011) explored the use of autonomous artificial agents in soundscape composition. These agents analyzed the spectral content and metadata from a large sound database to create soundscapes with minimal spectral overlap, showcasing the potential of metadata-driven autonomous systems.

Further advancements in soundscape modeling and simulation (Koutsomichalis and Valle 2014) introduced a system capable of modeling three-dimensional soundscapes using generative algorithms. This system supports real-time user interaction and complex spatial configurations, thereby expanding the possibilities for dynamic and interactive soundscapes. A more recent study (Zhuang et al. 2024) demonstrated the integration of soundscapes with visual elements using generative AI by translating semantic audio vectors into street-view images.

2.3. Generative Audio Transformers

Several leading companies have proposed solutions for generating sound using transformer models. Many solutions focus on speech synthesis (Chen et al. 2021; Chen et al. 2022; Le et al. 2023; Rubenstein et al. 2023; Zhang et al. 2023) or generating music-like content (Agostinelli et al. 2023; Huang et al. 2023a; Rouard et al. 2024), with prominent examples including Meta’s MusicGen (Copet et al. 2023) and consumer-grade platforms Udio and Suno (Yu et al. 2024).

Models with broader capabilities, such as Meta’s Audiobox (Vyas et al. 2023), are also being developed and published, particularly in academic research (Evans et al. 2024; Huang et al. 2023b; Liang et al. 2024; Liu et al. 2023a, 2023b; van den Oord et al. 2016). However, these models often lack significant market visibility despite their technical sophistication. Recent studies have also explored how to apply these transformer models in various domains, such as games (Marrinan et al. 2024), as creative assistants (Liu et al. 2023b; Tur 2024), or for music understanding (Li et al. 2024), and description (Qingqing et al. 2022).

Eleven Labs stands out as one of the leading companies in the generative audio field, developing transformers fine-tuned for different audio synthesis applications, including speech, music, and sound effects.

2.4. Generative Audio for Video

Eleven Labs provides prototype GitHub repositories featuring their models, including a tool for generating soundscapes for video clips (Eleven Labs 2024a). The referred version corresponds to commit 78c1c9d, dated 4 November 2024. This web-based application, hosted by the user via Node.js, communicates with the Eleven Labs API to generate cohesive audio tracks for the video segments. The generated soundscapes are contextually relevant to the video frames; however, the system exhibits some latency, extracting four frames per second. Additionally, it does not support the generation of separate audio stems for multilayered soundscapes.

2.5. Ethical Discourse

An important consideration in the application of artificial neural network technologies is the ethical implications of their use (Barnett 2023). The fair use of audio samples and the responsibility of model creators to source training data ethically and legally are significant issues in the discourse on the development and application of such tools.

More ethical challenges emerge in the domain of generative sound. For instance, when AI models replicate the stylistic elements or voices of real composers, musicians, or sound designers, the boundaries of artistic authorship and identity become blurred. Similarly, the use of AI-generated audio in contexts such as media, advertising, and video games may obscure the origin and authorship of content, leading to reduced accountability, as seen in other media domains (Ray 2023).

Another concern is bias in the training data, which may result in unrepresentative outputs or the perpetuation of cultural stereotypes. Deepfake audio technology further complicates ethical considerations because it enables the synthesis of realistic yet deceptive content, potentially leading to misinformation or the erosion of trust in recorded media.

While this article focuses on the technical and creative potential of generative transformers for sound design, further research is needed to establish ethical guidelines and verification frameworks for sound-generation systems that ensure fairness, consent, and accountability throughout their lifecycle.

3. Methodology

3.1. Main Assumptions

This study employs a methodology centered on designing a pipeline to generate multilayered soundtracks based on the prompting medium. Textual or visual information is processed into layers of sound messages that are directly or indirectly related to the semantic content of the source material. These layers were then evaluated through empirical research to assess the effectiveness and coherence of the generated soundscapes.

Two sonification paradigms guided this approach:

Direct Translation: A sound is directly mapped to a concept, where its meaning aligns literally with the source (e.g., forest sounds generated from text or images depicting a forest).
Free Interpretation: Sounds represent a concept indirectly using illustrative elements (e.g., fatigue expressed through rhythmic breathing, yawning, or the ticking of a clock).

3.2. GPT Feedback

An essential component of this pipeline is the use of a Generative Pre-trained Transformer (GPT), particularly a large language model (LLM), to generate and refine prompts that serve as inputs for another GPT model. This iterative process uses the capabilities of one GPT to enhance the effectiveness of the prompts used by the subsequent GPT, resulting in higher-quality and more contextually accurate soundscapes.

This methodology draws on the principles of prompt engineering, emphasizing the creation of targeted inputs that optimize the performance of the generative models. The approach resembles a “collaborative network”, where multiple models cooperate in synergy to enhance results. It also shares conceptual similarities with Generative Adversarial Networks (GANs), in which networks are designed to compete, but here they collaborate to achieve refined results.

4. Pipeline Implementation

4.1. The Pipeline Concept

The pipeline, illustrated in Figure 1, consists of complementary modules designed to transform the input content into sound descriptions and, ultimately, into soundscapes. The process can begin with either textual or visual input. A source-to-sound-prompt module processes the text or visual input into sound descriptions, which are then passed to a sound-prompt-to-sound module. This module converts descriptions into actual soundscapes.

4.2. Applied Technologies

The pipeline was implemented in Python (version 3.10) and integrated external APIs, incorporating widely used consumer-grade GPT models for its core functionality.

Eleven Labs API: A sound effects generator that enables users to create sound by providing descriptive prompts. This API uses natural language understanding to interpret user input and produce high-quality, context-specific sound effects (Eleven Labs 2024b).
OpenAI ChatGPT: A multimodal large language model (LLM) capable of handling both text and image inputs. The implementation utilized the gpt-4o-mini (18 July 2024) version, with additional testing conducted using gpt-4o (6 August 2024). (OpenAI 2024).

4.3. The Source-to-Sound-Prompt Module

The source-to-sound-prompt module processes text or image input (provided via a URL for images) using the ChatGPT model. The model was instructed to act as a sound designer assistant and generate detailed sound prompts.

Initially, ensuring consistency and reliability in the generated outputs posed challenges due to the generative and probabilistic nature of GPT models. Unlike explicitly defined algorithms with deterministic outputs, pre-trained transformer models produce unique and context-dependent responses, limited only by token and context window constraints.

To address these challenges and achieve consistency, a JSON schema was implemented to standardize the output formats. Each sound layer is defined by the following attributes:

Prompt: A descriptive text specifying the auditory characteristics of the sound layer.
Category: A folder structure categorizing sound layers based on general characteristics and specific concepts (e.g., “General_Tag\Concept_Name”).
Filename: A descriptive identifier for each sound layer formatted as “snake_case”.
Duration: Length of the audio segment for each sound layer.
Num_generations: Number of versions of the sound layer to be generated.
Prompt_influence: A value that determines the extent to which the descriptive prompt influences the generated sound. This allows for fine-tuning the balance between creative and literal adherence to the prompt.

4.4. Custom Instruction

A tailored custom instruction was provided to the GPT model to optimize its role as a sound design assistant. The instruction tasks the model to generate a JSON file that represents sound layers matching the auditory characteristics of a given concept. This process involves:

Decomposing the concept into distinct sound elements (e.g., natural sounds, ambiance, and human activity).
Generating 5–8 sound layers, each accompanied by a detailed description.
Populating metadata for each sound layer

4.5. Data Supplementary Materials and Code

The complete pipeline implementation, including the source code, JSON schema samples, and generated audio examples, is available on the Open Science Framework repository (Open Science Framework 2024a).

5. Preliminary Testing and Empirical Research

Initial testing and empirical research yielded preliminary conclusions and observations regarding the proposed solution.

5.1. Highlights

The proposed pipeline effectively facilitates the generation of multilayered soundscapes by combining atmospheric backgrounds with distinct point sounds. This approach has demonstrated substantial potential for creating complex auditory representations of specific locations or conceptual ideas, offering valuable tools for sound designers.

An example of the generated output is shown in Figure 2, which presents a sample generative soundscape pack featuring forest sounds visualized as waveforms with the corresponding nametags. The figure illustrates the arrangement of atmospheric and point sound layers in a 10-s long soundscape.

5.2. Limitations

While the solution shows significant promise, several limitations were identified during the testing process.

5.2.1. Overlapping Frequencies

The model often generates soundscapes with layers that span the entire frequency spectrum, thereby resulting in overlapping atmospheric layers. This overlap can cause excessive noise, particularly in the higher frequency ranges, and reduce the clarity of the combined soundscape. Ideally, each atmospheric layer should occupy a specific segment of the frequency spectrum to avoid masking effects and to maintain a balanced auditory experience.

5.2.2. Abrupt Endings Without Proper Attenuation

Some generated sound samples ended abruptly without fade-outs or attenuation, creating a jarring effect that disrupted the natural flow of the soundscape. This lack of smooth transitions reduces immersion and highlights the need for automated fade-in and fade-out mechanisms to ensure professional quality outputs.

5.2.3. Unintended Sound Artifacts

The model also exhibited issues with adequate representation of the intended sounds and generated unwanted artifacts at the end of the samples. During the initial tests, Eleven Labs demonstrated better performance when prototype instructions generated by gpt-4o-2024-08-06 were used. This suggests that more sophisticated models or refined instructions may improve the fidelity of the generated outputs. When the process was automated using the gpt-4o-mini-2024-07-18 version, certain distortions were observed in some cases. Prominent examples of these distortions include unwanted signals in the output sound files, which are further discussed.

5.2.4. Misinterpreting Prompt Focusing on Key Words

A problem was identified with the test file “footsteps_on_cobblestones_01.mp3”, which was generated using the following prompt: “The sound of soft footfalls on cobblestone paths can be heard as visitors explore the grounds. The subtle crunch of footsteps adds to the immersive experience of the castle surroundings”.

The generated sound unexpectedly featured a human voice delivered in a radio-style commentary, as though narrating a sporting event. This issue may be related to the transformer misinterpreting “footfalls” as “footballs”, leading to a representation resembling a sports-related context.

Transformers are capable of recognizing letter patterns even in misspelled words, allowing them to detect meaning despite errors. However, this tendency can have negative consequences when a correctly spelled word is misinterpreted as a similar but incorrect term, leading to misinformation, which is one of the universal challenges of using GPT models (Bubeck et al. 2023; Ray 2023).

5.2.5. Generating Random Musical Elements

The issue with the test file “horse_and_carriage_01.mp3” was generated using the following prompt: “Occasionally, the distant sound of a horse and carriage rolling along a nearby lane fades in, evoking images of the past and adding to the overall historical charm of the castle setting”.

The generated sound included unexpected musical elements, seemingly designed to evoke an emotional or illustrative effect rather than accurately representing the sound of horses and carriages. This indicates that the model, likely trained on a large dataset containing musical content, occasionally defaults to illustrative reasoning, producing music-like content intended to evoke the scene described rather than faithfully replicating the literal auditory input. Keywords such as “fades in”, “evoking images”, and “charm”, which lack literal sonification equivalents, may have been interpreted as descriptors in a musical context, guiding the model toward a more musically oriented approach.

5.3. Possible Improvements

To address these challenges, more control over the model output is essential. For instance, incorporating parameters for exclusion directives (e.g., “no” or “do not include”) could allow users to specify elements to omit, thereby improving the precision and relevance of the results. Similar features have been implemented in tools like Midjourney, where users can explicitly control unwanted elements in the generated images (Midjourney 2024).

5.4. The Necessity of Human Verification and Refinement

Human oversight is one of the main strategies to ensure the production of high-quality materials in automated systems incorporating transformer models (Edwards et al. 2020; Ouyang et al. 2022). While the pipeline successfully produces complex, multilayered soundscapes, its probabilistic nature can lead to errors or inconsistencies. Manual refinement ensures alignment with the intended concept, resolves deviations and enhances the overall quality of the final soundscape. This intervention is essential for professional and high-quality auditory production.

6. The Development of a Complementary Solution

6.1. Limitations of Generative Samples

Generatively created samples play a valuable role in soundscape creation despite their constraints, particularly in terms of sample length. The maximum length is influenced by factors such as computational complexity, temporal coherence, API-imposed constraints (e.g., 20 s for the Eleven Labs model), and query costs. These restrictions make it challenging to generate long, cohesive soundscapes in a single pass, leading to noticeable repetitions in extended audio environments.

6.2. The Granular Synthesizer as a Solution

To address these limitations, a granular synthesizer prototype was developed. This synthesizer processes short generative samples and enables looping and recombinations over time. Using a nondeterministic approach, it rearranges samples based on user-defined parameters, ensuring varied soundscapes that avoid predictable patterns and repetition.

6.3. Prototype Design and Implementation

The developed prototype, shown in Figure 3, is a granular synthesizer capable of generating dynamic and random soundscapes from a folder of attached audio samples. The full implementation is publicly available on the Open Science Framework platform (Open Science Framework 2024b) for further exploration and usage.

The system supports output through single or multiple channels, including stereo (left-right) and more complex multichannel configurations, such as 5.1 surround sound, with each channel independently populated by distinct sound grains. The synthesizer extracts grains from the stored audio files and reassembles them into nondeterministic combinations. Each grain was randomly selected and processed according to user-defined parameters to create a dynamic and immersive auditory experience.

6.3.1. User-Adjustable Parameters

The granular synthesizer features a Graphical User Interface (GUI) that allows users to control key parameters influencing the generation, combination, and presentation of sound grains:

Grain Size defines the duration of each grain extracted from the audio files.
Overlap determines the degree of overlap between consecutive grains.
Fade Duration sets the fade-in and fade-out times for each grain, ensuring smooth transitions.
Simultaneous Grains specifies the number of grains played simultaneously.
Duration Deviation introduces variability in the duration of each grain to prevent repetitive patterns.

6.3.2. Support for Multiple Instances

The prototype synthesizer can run in multiple instances, allowing users to categorize sounds into distinct types, such as atmospheric/ambient layers and one-shot effects. This categorization enables users to fine-tune the presentation and interaction of sound types, thereby creating a tailored and immersive experience.

Suggested use of instances:

Atmospheric/ambient layers: Parameters can be adjusted to favor longer grain sizes, increased overlap, and softer fade durations to create smooth, continuous soundscapes (Schafer’s keynotes).
One-shot effects: Parameters can be configured for shorter grain sizes, minimal overlap, and precise fade settings, focusing on the clarity and impact of isolated sound elements (Schafer signals).

7. Conclusions

7.1. Overview

This study investigates the application of generative transformers for creating multilayered soundscapes, uncovering both their potential and limitations. The proposed pipeline successfully demonstrates the feasibility of combining Generative Pre-trained Transformers to produce rich auditory environments using textual and visual prompts for sound design. Testing revealed the system’s ability to generate diverse soundscapes, including ambient layers and distinct point sounds, providing a powerful tool for sound designers seeking innovative approaches to creating sounds.

7.2. Challenges

Unintended artifacts, issues in prompt interpretation, and other challenges may underscore the need for refined control mechanisms. These limitations highlight the importance of implementing features like exclusion directives and parameter tuning, as well as the necessity of human verification and refinement to ensure quality and accuracy. Addressing these challenges would improve the precision of the outputs and their suitability for professional applications.

7.3. Complementary Solution

The development of a granular synthesizer prototype demonstrates a viable solution for extending the usability of generative samples by enabling dynamic and non-repetitive recombination. This tool, with its adjustable parameters and multichannel capabilities, provides sound designers with greater flexibility in creating immersive and adaptable soundscapes tailored to specific contexts.

7.4. Future Research

Future work should focus on enhancing the reliability and coherence of generative outputs, further refining prompt-engineering strategies, and exploring the integration of exclusion parameters. Additionally, expanding the capabilities of granular synthesis to incorporate real-time feedback and adaptive layering could further enhance its utility in both artistic and commercial sound-design contexts. By addressing these areas, generative transformer-based soundscapes could achieve broader adoption and significantly impact audio production in the future.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the results of this study are openly available on the Open Science Framework (OSF). All relevant datasets and code used in the implementation can be accessed via the following repositories: the Generative Soundscape Pipeline (Open Science Framework 2024a) and the Granular Audio Player (Open Science Framework 2024b).

Acknowledgments

The author extends heartfelt thanks to Krzysztof Cybulski and Adam Mart for their inspiration and guidance, which greatly influenced the direction of this research. This work was made possible in the supportive environment of the Feliks Nowowiejski Academy of Music in Bydgoszcz, an institution that fosters innovative and exploratory research, with special thanks to Elżbieta Wtorkowska and Aleksandra Kłaput-Wiśniewska, for their leadership and encouragement. Special appreciation goes to Barbara Okoń-Makowska and Katarzyna Figat for their encouragement to explore broader perspectives on acoustics. The author also extends thanks to Jacek Lesiński for his intricate insights into the language structure of the paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

Agostinelli, Andrea, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and et al. 2023. MusicLM: Generating Music from Text. arXiv arXiv:2301.11325. [Google Scholar]
Ashvala, Vinay, and Alexander Lerch. 2022. Evaluating Generative Audio Systems and Their Metrics. Paper presented at International Society for Music Information Retrieval Conference, Bengaluru, India, December 4–8. [Google Scholar]
Barnett, Julie. 2023. The Ethical Implications of Generative Audio Models: A Systematic Literature Review. Paper presented at AAAI/ACM Conference on AI, Ethics, and Society, Montreal, QC, Canada, August 8–10. [Google Scholar]
Birchfield, David, Nahla Mattar, and Hari Sundaram. 2005. Design of a Generative Model for Soundscape Creation. Paper presented at International Conference on Mathematics and Computing, January; Available online: https://www.researchgate.net/profile/Nahla-Mattar/publication/252768176_DESIGN_OF_A_GENERATIVE_MODEL_FOR_SOUNDSCAPE_CREATION/links/5d958e8392851c2f70e55d17/DESIGN-OF-A-GENERATIVE-MODEL-FOR-SOUNDSCAPE-CREATION.pdf (accessed on 30 November 2024).
Blesser, Barry, and Linda-Ruth Salter. 2007. Spaces Speak, Are You Listening? Experiencing Aural Architecture. Cambridge, MA: MIT Press. [Google Scholar]
Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, and et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv arXiv:2303.12712. [Google Scholar]
Chen, Sanyuan, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and et al. 2021. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal on Selected Topics in Signal Processing 16: 1505–18. [Google Scholar] [CrossRef]
Chen, Zhehuai, Yu Zhang, Andrew E. Rosenberg, Bhuvana Ramabhadran, Pedro R. Moreno, Ankur Bapna, and Heiga Zen. 2022. MAESTRO: Matched Speech Text Representations through Modality Matching. arXiv arXiv:2204.03409. [Google Scholar]
Copet, Jade, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. Simple and Controllable Music Generation. arXiv arXiv:2306.05284. [Google Scholar]
Edwards, Justin, Allison Perrone, and Philip R. Doyle. 2020. Transparency in Language Generation: Levels of Automation. Paper presented at 2nd Conference on Conversational User Interfaces, Bilbao, Spain, June 22–24. [Google Scholar]
Eigenfeldt, Arne, and Philippe Pasquier. 2011. Negotiated Content: Generative Soundscape Composition by Autonomous Musical Agents in Coming Together: Freesound. Paper presented at International Conference on Innovative Computing and Cloud Computing, Mexico City, Mexico, April 27–29; pp. 27–32. Available online: https://computationalcreativity.net/iccc2011/proceedings/the_social/eigenfeldt_iccc11.pdf (accessed on 30 November 2024).
Eleven Labs. 2024a. Examples for Sound Effects. GitHub. Available online: https://github.com/elevenlabs/elevenlabs-examples/tree/main/examples/sound-effects/video-to-sfx (accessed on 30 November 2024).
Eleven Labs. 2024b. Sound Effects. Eleven Labs. Available online: https://elevenlabs.io/sound-effects (accessed on 30 November 2024).
Evans, Zach, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2024. Stable Audio Open. arXiv arXiv:2407.14358. [Google Scholar]
Griffiths, Paul. 2010. Modern Music and After, 3rd ed. New York: Oxford University Press. [Google Scholar]
Holmes, Thom. 2020. Electronic and Experimental Music: Technology, Music, and Culture. New York: Routledge. Available online: http://archive.org/details/electronicexperi0000holm_3rded (accessed on 30 November 2024).
Huang, Qingqing, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, and et al. 2023a. Noise2Music: Text-Conditioned Music Generation with Diffusion Models. arXiv arXiv:2302.03917. [Google Scholar]
Huang, Rongjie, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, and et al. 2023b. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. arXiv arXiv:2304.12995. [Google Scholar]
Ji, Shulei, Jing Luo, and Xinyu Yang. 2020. A Comprehensive Survey on Deep Music Generation: Multi-Level Representations, Algorithms, Evaluations, and Future Directions. arXiv arXiv:2011.06801. [Google Scholar]
Koutini, Khaled, Jan Schlüter, Hamid Eghbalzadeh, and Gerhard Widmer. 2021. Efficient Training of Audio Transformers with Patchout. arXiv arXiv:2110.05069. [Google Scholar] [CrossRef]
Koutsomichalis, Marinos, and Andrea Valle. 2014. SoundScapeGenerator: Soundscape Modelling and Simulation. Paper presented at XX CIM, Rome, Italy, October 20–22; pp. 65–70. Available online: https://iris.unito.it/retrieve/handle/2318/152887/27120/soundscapeModelling.pdf (accessed on 30 November 2024).
Kreuk, Felix, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. AudioGen: Textually Guided Audio Generation. arXiv arXiv:2209.15352. [Google Scholar]
Le, Matthew, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sarı, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and et al. 2023. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. Neural Information Processing Systems 36: 14005–34. [Google Scholar] [CrossRef]
Li, Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, and et al. 2024. MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training. arXiv arXiv:2306.00107. [Google Scholar]
Liang, Jinhua, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, and Emmanouil Benetos. 2024. WavCraft: Audio Editing and Generation with Large Language Models. arXiv arXiv:2403.09527. [Google Scholar]
Liu, Haohe, Zehua Chen, Zhen Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang, and Mark D. Plumbley. 2023a. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. arXiv arXiv:2301.12503. [Google Scholar] [CrossRef]
Liu, Xubo, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, and et al. 2023b. WavJourney: Compositional Audio Creation with Large Language Models. arXiv arXiv:2307.14335. [Google Scholar] [CrossRef]
Marrinan, Thomas, Pakeeza Akram, Oli Gurmessa, and Anthony Shishkin. 2024. Leveraging AI to Generate Audio for User-Generated Content in Video Games. arXiv arXiv:2404.17018. [Google Scholar]
Midjourney. 2024. Midjourney Documentation: No Parameter. Midjourney. Available online: https://docs.midjourney.com/docs/no (accessed on 30 November 2024).
Oh, Sangshin, Minsung Kang, Hyeongi Moon, Keunwoo Choi, and Ben Sangbae Chon. 2023. A Demand-Driven Perspective on Generative Audio AI. arXiv arXiv:2307.04292. [Google Scholar]
OpenAI. 2024. Models Documentation. OpenAI. Available online: https://platform.openai.com/docs/models (accessed on 30 November 2024).
Open Science Framework. 2024a. Generative Soundscape Pipeline. Available online: https://osf.io/qjm4a/?view_only=9ffac3ce06994e38bad5a81f3f16d82c (accessed on 7 December 2024).
Open Science Framework. 2024b. Granular Audio Player. Available online: https://osf.io/yf5qd/?view_only=60029d17660b42a5923408c883835a8e (accessed on 7 December 2024).
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and et al. 2022. Training Language Models to Follow Instructions with Human Feedback. Neural Information Processing Systems 35: 27730–44. [Google Scholar] [CrossRef]
Pinch, Trevor J., and Frank Trocco. 2009. Analog Days: The Invention and Impact of the Moog Synthesizer. Cambridge, MA: Harvard University Press. [Google Scholar]
Porter, Brian, and Edouard Machery. 2024. AI-Generated Poetry Is Indistinguishable from Human-Written Poetry and Is Rated More Favorably. Scientific Reports 14: 26133. [Google Scholar] [CrossRef]
Qingqing, Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel P. W. Ellis. 2022. MuLan: A Joint Embedding of Music Audio and Natural Language. Paper presented at International Society for Music Information Retrieval Conference, Bengaluru, India, December 4–8. [Google Scholar]
Ray, Partha Pratim. 2023. ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope. Internet of Things and Cyber-Physical Systems 3: 121–54. [Google Scholar] [CrossRef]
Rouard, Simon, Yossi Adi, Jade Copet, Axel Roebel, and Alexandre Défossez. 2024. Audio Conditioning for Music Generation via Discrete Bottleneck Features. arXiv arXiv:2407.12563. [Google Scholar]
Rubenstein, Paul K., Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, and et al. 2023. AudioPaLM: A Large Language Model that Can Speak and Listen. arXiv arXiv:2306.12925. [Google Scholar]
Samson, Grzegorz. 2024. Procedurally Generated AI Compound Media for Expanding Audial Creations, Broadening Immersion and Perception Experience. International Journal of Electronics and Telecommunications 70: 341–48. [Google Scholar] [CrossRef]
Schafer, R. Murray. 1993. The Soundscape: Our Sonic Environment and the Tuning of the World. Rochester: Inner Traditions/Bear. Available online: https://books.google.pl/books?id=_N56QgAACAAJ (accessed on 30 November 2024).
Stamać, Ivan. 2005. Acoustical and Musical Solution to Wave-Driven Sea Organ in Zadar. Paper presented at 2nd Congress of Alps-Adria Acoustics Association and 1st Congress of Acoustical Society of Croatia, Opatija, Croatia, June 23–24; pp. 203–6. [Google Scholar]
Taruskin, Richard, and Christopher H. Gibbs. 2013. The Oxford History of Western Music, College ed. New York: Oxford University Press. [Google Scholar]
Toop, David. 1995. Ocean of Sound: Aether Talk, Ambient Sound and Imaginary Worlds. London: Serpent’s Tail. [Google Scholar]
Tur, Ada. 2024. Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation. Paper presented at AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, February 20–27. [Google Scholar]
van den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. arXiv arXiv:1609.03499. [Google Scholar]
Verma, Prateek, and Chris Chafe. 2021. A Generative Model for Raw Audio Using Transformer Architectures. arXiv arXiv:2106.16036. [Google Scholar]
Verma, Prateek, and Jonathan Berger. 2021. Audio Transformers: Transformer Architectures for Large Scale Audio Understanding. Adieu Convolutions. arXiv arXiv:2105.00335. [Google Scholar]
Vyas, Apoorv, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, and et al. 2023. Audiobox: Unified Audio Generation with Natural Language Prompts. arXiv arXiv:2312.15821. [Google Scholar]
Wayne, Kevin. 2023. Mozart Musical Dice Game. Paper presented at 54th SIGCSE Technical Symposium on Computer Science Education, Toronto, ON, Canada, March 15–18. [Google Scholar]
Xenakis, Iannis, and Sharon Kanach. 1992. Formalized Music: Thought and Mathematics in Composition. Hillsdale: Pendragon Press. Available online: https://books.google.pl/books?id=fDAJAQAAMAAJ (accessed on 30 November 2024).
Yu, Jiaxing, Songruoyao Wu, Guanting Lu, Zijin Li, Li Zhou, and Kejun Zhang. 2024. Suno: Potential, Prospects, and Trends. Frontiers of Information Technology & Electronic Engineering 25: 1025–30. [Google Scholar] [CrossRef]
Zhang, Chenshuang, Chaoning Zhang, Shusen Zheng, Mengchun Zhang, Maryam Qamar, Sung-Ho Bae, and In So Kweon. 2023. A Survey on Audio Diffusion Models: Text to Speech Synthesis and Enhancement in Generative AI. arXiv arXiv:303.13336. [Google Scholar]
Zhuang, Yonggai, Yuhao Kang, Teng Fei, Meng Bian, and Yunyan Du. 2024. From Hearing to Seeing: Linking Auditory and Visual Place Perceptions with Soundscape-to-Image Generative Artificial Intelligence. Computers, Environment and Urban Systems 110: 102122. [Google Scholar] [CrossRef]

Figure 1. A generative soundscape flowchart with an added granular synthesizer as a complementary solution (source: own research).

Figure 2. Waveform visualization of a generative forest soundscape pack. Each waveform represents a distinct 10-s sound layer corresponding to a specific conceptual element within the soundscape. All sound files were normalized for graphical representation (source: own research).

Figure 3. Granular audio player GUI (source: own research).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Samson, G. Perspectives on Generative Sound Design: A Generative Soundscapes Showcase. Arts 2025, 14, 67. https://doi.org/10.3390/arts14030067

AMA Style

Samson G. Perspectives on Generative Sound Design: A Generative Soundscapes Showcase. Arts. 2025; 14(3):67. https://doi.org/10.3390/arts14030067

Chicago/Turabian Style

Samson, Grzegorz. 2025. "Perspectives on Generative Sound Design: A Generative Soundscapes Showcase" Arts 14, no. 3: 67. https://doi.org/10.3390/arts14030067

APA Style

Samson, G. (2025). Perspectives on Generative Sound Design: A Generative Soundscapes Showcase. Arts, 14(3), 67. https://doi.org/10.3390/arts14030067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Perspectives on Generative Sound Design: A Generative Soundscapes Showcase

Abstract

1. Introduction

1.1. Transformers in Media Generation

1.2. Historical Context

1.2.1. The Beginnings of Generative Music

1.2.2. Stochastic Soundscapes Origins

1.2.3. Analogue Electronics for Generative Sound

1.2.4. Digital Electronics for Generative Sound

1.2.5. Transformers for Generative Sound Design

2. Related Work

2.1. A Definition of Soundscape

2.2. Generative Soundscapes

2.3. Generative Audio Transformers

2.4. Generative Audio for Video

2.5. Ethical Discourse

3. Methodology

3.1. Main Assumptions

3.2. GPT Feedback

4. Pipeline Implementation

4.1. The Pipeline Concept

4.2. Applied Technologies

4.3. The Source-to-Sound-Prompt Module

4.4. Custom Instruction

4.5. Data Supplementary Materials and Code

5. Preliminary Testing and Empirical Research

5.1. Highlights

5.2. Limitations

5.2.1. Overlapping Frequencies

5.2.2. Abrupt Endings Without Proper Attenuation

5.2.3. Unintended Sound Artifacts

5.2.4. Misinterpreting Prompt Focusing on Key Words

5.2.5. Generating Random Musical Elements

5.3. Possible Improvements

5.4. The Necessity of Human Verification and Refinement

6. The Development of a Complementary Solution

6.1. Limitations of Generative Samples

6.2. The Granular Synthesizer as a Solution

6.3. Prototype Design and Implementation

6.3.1. User-Adjustable Parameters

6.3.2. Support for Multiple Instances

7. Conclusions

7.1. Overview

7.2. Challenges

7.3. Complementary Solution

7.4. Future Research

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI