Emerging Trends in Generative-AI Based Audio Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: 31 December 2025 | Viewed by 1495

Special Issue Editor


E-Mail Website
Guest Editor
Center for Voice Intelligence and Security, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Interests: automated discovery; measurement; representation; and learning of the information encoded in voice signal for optimal voice intelligence

Special Issue Information

Dear Colleagues,

Generative AI continues to transform audio processing through rapid advances in deep learning, signal modeling, and interdisciplinary applications. Beyond traditional domains like speech synthesis and music generation, research has expanded to include deepfake detection, audio watermarking, audio for embodied AI, and multimodal systems that integrate sound with text, vision, and haptic feedback. Meanwhile, emerging large-scale foundation models and novel architectures such as diffusion-based generators and transformer-based pipelines have presented new opportunities to synthesize and analyze audio in previously unattainable ways.

This Special Issue aims to provide a comprehensive and forward-looking collection of innovative methodologies, theoretical advances, and real-world applications in generative AI for audio. We invite contributions spanning foundational research, rigorous evaluations, proof-of-concept prototypes, and production-level deployments to foster collaborative and multidisciplinary exchange among industry practitioners, academic researchers, and policymakers.

(1) Introduction

Generative AI has rapidly evolved from a niche research topic to a major driver of innovation across speech technologies, music production, sound design, and beyond. Traditional deep learning frameworks like GANs and VAEs have paved the way for more sophisticated models, such as diffusion models and large transformer-based architectures that comprise audio foundation models, enabling unprecedented fidelity in synthesized audio. These generative systems can summarize, translate, and contextualize audio content.

Such advances introduce new opportunities and challenges. Applications like audio watermarking and deepfake detection address urgent concerns surrounding authenticity and ethical use of generative content. Meanwhile, embodied AI leverages audio for a human–robot interaction, facilitating more naturalistic communication in assistive devices, virtual environments, and physical robotics. As generative models become more adept at zero-shot or few-shot learning, they open the door to cross-lingual and cross-modal scenarios, transforming how audio is created, consumed, and regulated.

The entertainment, accessibility, healthcare, education, and robotics industries increasingly rely on robust and intelligent audio systems. This growing dependence underscores the need for computational efficiency, scalability, and responsible AI guidelines. By uniting machine learning, signal processing, music information retrieval, cognitive science, and law/ethics, this Special Issue aims to illuminate the state of the art and the future directions of generative-AI-based audio research.

(2) Aim of the Special Issue

  1. Document Emerging Frontiers: Capture the latest developments in foundation models, multimodal architectures, and diffusion-based methods, highlighting their capabilities and limitations in audio.
  2. Promote Responsible Innovation: Facilitate research on deepfake detection, audio watermarking, and content authentication to ensure ethical practices and trust in generative audio.
  3. Foster Interdisciplinary Collaboration: Bring together experts from the signal processing, robotics, music technology, law, and healthcare industries to tackle real-world challenges, from embodied AI to therapeutic applications.
  4. Encourage Novel Evaluation Protocols: Showcase standardized metrics, benchmarking datasets, and best practices for rigorously evaluating generative audio models, including subjective and objective measures.
  5. Identify Future Directions: Discuss emerging research questions around scalability, real-time deployment, privacy, and regulatory frameworks to shape upcoming advancements in generative AI-based audio.

By providing a unified platform, this Special Issue aims to serve as a reference point for academic inquiry and industrial deployment, charting a course for responsible and impactful progress in generative-AI-based audio processing.

(3) Suggested Themes and Article Types for Submissions

We welcome original research articles, review papers, perspective pieces, and system demonstrations on the following topics (among others):

  1. Advanced Generative Architectures for Audio
    • Diffusion models, transformers, and hybrid approaches (GANs, VAEs, and diffusion).
    • Foundation models for audio that enable zero-shot or few-shot generation.
  2. Speech Synthesis and Voice Modeling
    • High-fidelity TTS (text-to-speech) with prosody control, emotional expression, and multilingual support.
    • Voice cloning and style transfer techniques with attention to authenticity and ethics.
  3. Music Generation and Composition
    • Generative music models with real-time collaboration, interactive composition tools, and cross-genre adaptation.
    • Creative AI frameworks for remixing, personalization, and user-driven musical experiences.
  4. Audio Enhancement and Restoration
    • Noise reduction, dereverberation, super-resolution, and audio inpainting.
    • Generative methods for historical/archival audio restoration and content preservation.
  5. Audio Watermarking and Authentication
    • Novel watermarking strategies that are robust against generative manipulation.
    • Protection of intellectual property and forensic analysis of AI-synthesized audio.
  6. Deepfake Detection and Content Verification
    • Algorithms for detecting synthesized or manipulated speech and music.
    • Robust authentication systems and frameworks for identifying generative artifacts.
  7. Audio Summarization and Reasoning
    • Techniques for condensing long-form audio, such as meetings, lectures, or podcasts.
    • Exploiting natural language processing (NLP) for cross-modal summarization (audio-to-text).
    • Integrating reasoning capabilities to enable context-aware audio interactions.
  8. Embodied AI and Human–Robot Interaction
    • Speech-driven robot control, sound-based localization, and audio-based sensor fusion in robotics.
    • Generative audio for virtual reality (VR), augmented reality (AR), and immersive simulations.
  9. Multimodal and Cross-Modal Generative Models
    • Text-to-audio, image-to-sound, or sensor-to-audio pipelines for creative content generation.
    • Combining linguistic, visual, and auditory cues for contextual and situational understanding.
  10. Ethical, Societal, and Regulatory Concerns
    • Responsible AI guidelines, ethical design, and privacy-preserving approaches.
    • Legal frameworks for intellectual property, user consent, and content licensing in generative audio.
  11. Computational Efficiency, Deployment, and Scaling
    • Resource-constrained model deployment (edge computing and mobile devices).
    • Distributed training, federated learning, and cloud-based solutions for large-scale audio.
  12. Evaluation Metrics, Benchmarks, and Standardization
    • Objective and subjective evaluation methodologies for generative audio (e.g., MOS and preference tests).
    • Open-source datasets, community challenges, and reproducibility initiatives.
  13. Applications in Accessibility, Healthcare, and Education
    • Assistive technologies for users who are blind or hard of hearing (text-to-speech and speech-to-text).
    • Therapeutic and diagnostic applications employing generative audio (e.g., speech therapy).
    • Interactive learning environments leveraging real-time generative feedback.

We encourage researchers, practitioners, and industry experts to contribute innovative, rigorous, and impactful work. By bringing together a broad spectrum of approaches and perspectives, this Special Issue will push the boundaries of current generative-AI-based audio research and lay the groundwork for future advancements. For more information about submission guidelines and timelines, please refer to the journal’s website or contact the guest editors directly.

We look forward to your innovative contributions and fostering vibrant discussions on the future of generative AI in audio processing.

Dr. Rita Singh
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • generative AI
  • audio watermarking
  • deepfake detection
  • speech processing
  • music generation
  • audio foundation models
  • neural audio reasoning
  • embodied AI
  • audio summarization
  • multimodal integration

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (2 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 2171 KiB  
Article
CBAM-ResNet: A Lightweight ResNet Network Focusing on Time Domain Features for End-to-End Deepfake Speech Detection
by Yuezhou Wu, Hua Huang, Zhiri Li and Siling Zhang
Electronics 2025, 14(12), 2456; https://doi.org/10.3390/electronics14122456 - 17 Jun 2025
Viewed by 462
Abstract
With the rapid development of synthetic speech and deepfake technology, fake speech poses a severe challenge to voice authentication systems. Traditional detection methods generally rely on manual feature extraction, facing problems such as limited feature expression ability and insufficient cross-scenario generalization performance. To [...] Read more.
With the rapid development of synthetic speech and deepfake technology, fake speech poses a severe challenge to voice authentication systems. Traditional detection methods generally rely on manual feature extraction, facing problems such as limited feature expression ability and insufficient cross-scenario generalization performance. To this end, this paper proposes an improved ResNet network based on a Convolutional Block Attention Module (CBAM) for end-to-end fake speech detection. This method introduces channel attention and spatial attention mechanisms into the ResNet network structure to enhance the model’s attention to the temporal characteristics of speech, thereby improving the ability to distinguish between real and fake speech. The proposed model adopts an end-to-end training strategy, directly processes the original spectrogram input, uses the residual structure to alleviate the gradient vanishing problem in the deep network, and enhances the collaborative expression ability of local details and global context through the CBAM module. The experiment is conducted on the ASVspoof2019 LA dataset, and the equal error rate (EER) is used as the main evaluation indicator. The experimental results show that compared with traditional deepfake speech detection methods, the proposed model achieves better performance in indicators such as EER, verifying the effectiveness of the CBAM attention mechanism in forged speech detection. Full article
(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)
Show Figures

Figure 1

17 pages, 1071 KiB  
Article
Empirical Analysis of Learning Improvements in Personal Voice Activity Detection Frameworks
by Yu-Tseng Yeh, Chia-Chi Chang and Jeih-Weih Hung
Electronics 2025, 14(12), 2372; https://doi.org/10.3390/electronics14122372 - 10 Jun 2025
Viewed by 610
Abstract
Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized [...] Read more.
Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized voice assistants and robust speech recognition, where accurately identifying a target speaker’s voice amidst background speech and noise is crucial for both user experience and computational efficiency. Despite significant progress, PVAD frameworks still face challenges related to temporal modeling, integration of speaker information, class imbalance, and deployment on resource-constrained devices. In this study, we present a systematic enhancement of the PVAD framework through four key innovations: (1) a Bi-GRU (Bidirectional Gated Recurrent Unit) layer for improved temporal modeling of speech dynamics, (2) a cross-attention mechanism for context-aware speaker embedding integration, (3) a hybrid CE-AUROC (Cross-Entropy and Area Under Receiver Operating Characteristic) loss function to address class imbalance, and (4) Cosine Annealing Learning Rate (CALR) for optimized training convergence. Evaluated on LibriSpeech datasets under varied acoustic conditions, the proposed modifications demonstrate significant performance gains over the baseline PVAD framework, achieving 87.59% accuracy (vs. 86.18%) and 0.9481 mean Average Precision (vs. 0.9378) while maintaining real-time processing capabilities. These advancements address critical challenges in PVAD deployment, including robustness to noisy environments, with the hybrid loss function reducing false negatives by 12% in imbalanced scenarios. The work provides practical insights for implementing personalized voice interfaces on resource-constrained devices. Future extensions will explore quantized inference and multi-modal sensor fusion to further bridge the gap between laboratory performance and real-world deployment requirements. Full article
(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)
Show Figures

Figure 1

Back to TopTop