Diffusion-Based Model for Audio Steganography
Abstract
1. Introduction
- This paper proposed a novel steganographic framework based on a diffusion probability model. Compared to traditional autoencoders and GAN methods, this framework more effectively models the complex distribution of audio signals and achieves more natural information embedding.
- This paper proposed a diffusion steganography strategy that integrated a diffusion mechanism tailored to the time–frequency characteristics of audio signals. This approach effectively addresses the trade-off between capacity and robustness found in traditional methods by embedding and extracting information through a carefully designed, multi-stage diffusion process.
2. Related Work
2.1. Traditional Steganographic Approaches
2.2. GAN-Based Steganography Approaches
3. Methodology
3.1. Overall Framework
3.2. Forward Diffusion
3.3. Reverse Generation
3.4. Training Strategy
3.5. Secret Data Extraction
4. Experiments
4.1. Experimental Settings
4.2. Evaluation Criteria
- PESQ: PESQ’s values span from −0.5 to , with greater values signifying superior perceptual quality.
- SNR: The SNR represents the mean power ratio between the inherent signal and the noise.
4.3. Comparison Methods
4.4. Robustness Evaluation
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bender, W.; Gruhl, D.; Morimoto, N.; Lu, A. Techniques for data hiding. IBM Syst. J. 1996, 35, 313–336. [Google Scholar] [CrossRef]
- Ghasemzadeh, H.; Kayvanrad, M.H. Comprehensive review of audio steganalysis methods. IET Signal Process. 2018, 12, 673–687. [Google Scholar] [CrossRef]
- Wu, J.; Chen, B.; Luo, W.; Fang, Y. Audio steganography based on iterative adversarial attacks against convolutional neural networks. IEEE Trans. Inf. Forensics Secur. 2020, 15, 2282–2294. [Google Scholar] [CrossRef]
- Chen, K. Digital watermarking and steganography. In Encyclopedia of Multimedia Technology and Networking, 2nd ed.; IGI Global: Palmdale, PA, USA, 2009; pp. 402–409. [Google Scholar]
- Chen, L.; Wang, R.; Dong, L.; Yan, D. Imperceptible adversarial audio steganography based on psychoacoustic model. Multimed. Tools Appl. 2023, 82, 26451–26463. [Google Scholar] [CrossRef]
- Singh, L.; Singh, A.K.; Singh, P.K. Secure data hiding techniques: A survey. Multimed. Tools Appl. 2020, 79, 15901–15921. [Google Scholar]
- Zhang, Z.; Zeng, J.; Xu, Y.; Yi, X.; Cao, Y.; Liu, C. Triple-Stage Robust Audio Steganography Framework with AAC Encoding for Lossy Social Media Channels. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, San Jose, CA, USA, 18–20 June 2025; pp. 131–141. [Google Scholar]
- Wang, J.; Wang, K. A novel audio steganography based on the segmentation of the foreground and background of audio. Comput. Electr. Eng. 2025, 123, 110026. [Google Scholar] [CrossRef]
- Subramanian, N.; Cheheb, I.; Elharrouss, O.; Al-Maadeed, S.; Bouridane, A. End-to-end image steganography using deep convolutional autoencoders. IEEE Access 2021, 9, 135585–135593. [Google Scholar] [CrossRef]
- Peng, J.; Liao, Y.; Tang, S. Audio steganalysis using multi-scale feature fusion-based attention neural network. IET Commun. 2025, 19, e12806. [Google Scholar]
- Yang, Z.L.; Zhang, S.Y.; Hu, Y.T.; Hu, Z.W.; Huang, Y.F. VAE-Stega: Linguistic steganography based on variational auto-encoder. IEEE Trans. Inf. Forensics Secur. 2020, 16, 880–895. [Google Scholar] [CrossRef]
- Chen, L.; Wang, R.; Yan, D.; Wang, J. Learning to generate steganographic cover for audio steganography using GAN. IEEE Access 2021, 9, 88098–88107. [Google Scholar] [CrossRef]
- Li, J.; Wang, K.; Jia, X. A coverless audio steganography based on generative adversarial networks. Electronics 2023, 12, 1253. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; Yu, D. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
- Huang, R.; Huang, J.; Yang, D.; Ren, Y.; Liu, L.; Li, M.; Ye, Z.; Liu, J.; Yin, X.; Zhao, Z. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 13916–13932. [Google Scholar]
- Gambhir, A.; Khara, S. Integrating RSA cryptography & audio steganography. In Proceedings of the 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 29–30 April 2016; pp. 481–484. [Google Scholar]
- Mishra, A.; Johri, P.; Mishra, A. Audio steganography using ASCII code and GA. In Proceedings of the 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS), Dubai, United Arab Emirates, 18–20 December 2017; pp. 646–651. [Google Scholar]
- Nassrullah, H.A.; Flayyih, W.N.; Nasrullah, M.A. Enhancement of LSB Audio Steganography Based on Carrier and Message Characteristics. J. Inf. Hiding Multim. Signal Process. 2020, 11, 126–137. [Google Scholar]
- Oh, H.O.; Seok, J.W.; Hong, J.W.; Youn, D.H. New echo embedding technique for robust and imperceptible audio watermarking. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 3, pp. 1341–1344. [Google Scholar]
- Erfani, Y.; Siahpoush, S. Robust audio watermarking using improved TS echo hiding. Digit. Signal Process. 2009, 19, 809–814. [Google Scholar] [CrossRef]
- Ghasemzadeh, H.; Kayvanrad, M.H. Toward a robust and secure echo steganography method based on parameters hopping. In Proceedings of the 2015 Signal Processing and Intelligent Systems Conference (SPIS), Tehran, Iran, 16–17 December 2015; pp. 143–147. [Google Scholar]
- Fu, Z.; Wang, F.; Sun, X.M.; Wang, Y. Research on steganography of digital images based on deep learning. Chin. J. Comput. 2020, 43, 1656–1672. [Google Scholar]
- Li, S.; Xue, M.; Zhao, B.Z.H.; Zhu, H.; Zhang, X. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Trans. Dependable Secur. Comput. 2020, 18, 2088–2105. [Google Scholar] [CrossRef]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
- Tang, W.; Li, B.; Tan, S.; Barni, M.; Huang, J. CNN-based adversarial embedding for image steganography. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2074–2087. [Google Scholar] [CrossRef]
- Lemercier, J.M.; Richter, J.; Welker, S.; Moliner, E.; Välimäki, V.; Gerkmann, T. Diffusion models for audio restoration: A review. IEEE Signal Process. Mag. 2025, 41, 72–84. [Google Scholar] [CrossRef]
- Alexanderson, S.; Nagy, R.; Beskow, J.; Henter, G.E. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Trans. Graph. (TOG) 2023, 42, 1–20. [Google Scholar] [CrossRef]
- Ghosal, D.; Majumder, N.; Mehrish, A.; Poria, S. Text-to-audio generation using instruction guided latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 20 August 2023; pp. 3590–3598. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. N 1993, 93, 27403. [Google Scholar]
- Minematsu, N.; Tomiyama, Y.; Yoshimoto, K.; Shimizu, K.; Nakagawa, S.; Dantsuji, M.; Makino, S. English Speech Database Read by Japanese Learners for CALL System Development. In Proceedings of the LREC, Las Palmas, Spain, 29–31 May 2002. [Google Scholar]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
- Sharp, T. An implementation of key-based digital signal steganography. In Proceedings of the International Workshop on Information Hiding, Pittsburgh, PA, USA, 25–27 April 2001; pp. 13–26. [Google Scholar]
- Filler, T.; Judas, J.; Fridrich, J. Minimizing additive distortion in steganography using syndrome-trellis codes. IEEE Trans. Inf. Forensics Secur. 2011, 6, 920–935. [Google Scholar] [CrossRef]
- Yang, J.; Zheng, H.; Kang, X.; Shi, Y.Q. Approaching optimal embedding in audio steganography with GAN. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2827–2831. [Google Scholar]
- Lin, Y.; Wang, R.; Yan, D.; Dong, L.; Zhang, X. Audio steganalysis with improved convolutional neural network. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, Paris, France, 3–5 July 2019; pp. 210–215. [Google Scholar]
- Chen, B.; Luo, W.; Li, H. Audio steganalysis with convolutional neural network. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, Philadelphia, PA, USA, 20–22 June 2017; pp. 85–90. [Google Scholar]
Dataset | Steganography Method | Embedding Rates (bps) | ||||
---|---|---|---|---|---|---|
0.5 | 0.4 | 0.3 | 0.2 | 0.1 | ||
TIMIT | LSBM [34] | 76.13 | 72.37 | 68.79 | 67.22 | 65.06 |
70.33 | 68.06 | 66.14 | 65.24 | 61.36 | ||
STC [35] | 68.34 | 64.86 | 61.35 | 58.75 | 51.32 | |
66.28 | 62.39 | 59.64 | 55.71 | 50.44 | ||
Yang et al. [36] | 67.37 | 64.64 | 60.17 | 56.36 | 52.07 | |
65.47 | 62.28 | 59.54 | 54.58 | 50.47 | ||
Chen et al. [12] | 65.18 | 62.37 | 58.79 | 54.43 | 50.04 | |
62.37 | 59.62 | 55.48 | 52.27 | 49.16 | ||
VAE_Stega [11] | 65.04 | 61.87 | 56.25 | 53.08 | 50.04 | |
62.16 | 58.73 | 55.85 | 52.07 | 49.15 | ||
Proposed method | 63.16 | 60.22 | 53.18 | 51.16 | 48.03 | |
60.67 | 56.47 | 54.58 | 51.04 | 48.25 | ||
UME | LSBM [34] | 76.23 | 71.06 | 69.88 | 66.48 | 60.26 |
72.37 | 68.39 | 65.48 | 62.19 | 58.23 | ||
STC [35] | 70.29 | 67.59 | 62.25 | 60.96 | 57.46 | |
66.79 | 64.85 | 61.05 | 57.48 | 54.33 | ||
Yang et al. [36] | 65.26 | 63.31 | 60.18 | 57.17 | 54.57 | |
66.29 | 64.67 | 60.81 | 57.47 | 52.12 | ||
Chen et al. [12] | 63.15 | 61.46 | 58.27 | 54.39 | 51.74 | |
61.05 | 58.46 | 56.37 | 53.35 | 49.87 | ||
VAE_Stega [11] | 62.76 | 60.95 | 56.22 | 52.81 | 50.18 | |
60.76 | 57.48 | 55.85 | 51.56 | 48.59 | ||
Proposed method | 62.05 | 59.64 | 54.26 | 51.74 | 48.22 | |
60.28 | 55.72 | 53.13 | 50.48 | 47.12 |
Attack Type | Intensity | Proposed Method | LSBM | STC | Yang et al. [36] | Chen et al. [5] | VAE_Stega |
---|---|---|---|---|---|---|---|
Gaussian Noise | 4 dB | 4.6%/95.4% | 9.8%/90.2% | 8.7%/91.3% | 7.5%/92.5% | 6.1%/93.9% | 5.7%/94.3% |
8 dB | 5.8%/94.2% | 10.4%/89.6% | 9.2%/90.8% | 8.1%/91.9% | 7.6%/92.4% | 7.0%/93.0% | |
16 dB | 6.3%/93.7% | 12.3%/87.7% | 10.5%/89.5% | 9.3%/90.7% | 8.5%/91.5% | 7.8%/92.2% | |
Uniform Noise | 4 dB | 4.2%/95.8% | 9.1%/90.9% | 8.3%/91.7% | 7.2%/92.8% | 6.4%/93.6% | 5.3%/94.7% |
8 dB | 5.5%/94.5% | 9.8%/90.2% | 9.6%/90.4% | 7.7%/92.3% | 7.2%/92.8% | 6.7%/93.3% | |
16 dB | 6.0%/93.0% | 10.6%/89.4% | 9.7%/90.3% | 8.8%/91.2% | 8.1%/91.9% | 7.4%/92.6% |
Index | Modified Variants |
---|---|
#1 | Proposed framework (complete architecture) |
#2 | Remove the normalization operation for the input data |
#3 | Remove the posterior constraint |
#4 | Remove skip connections in the main framework |
#5 | Remove ReLU activation in the proposed framework |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xi, J.; Xia, Z.; Zhang, W.; Xie, Y.; Zhao, L. Diffusion-Based Model for Audio Steganography. Electronics 2025, 14, 4019. https://doi.org/10.3390/electronics14204019
Xi J, Xia Z, Zhang W, Xie Y, Zhao L. Diffusion-Based Model for Audio Steganography. Electronics. 2025; 14(20):4019. https://doi.org/10.3390/electronics14204019
Chicago/Turabian StyleXi, Ji, Zhengwang Xia, Weiqi Zhang, Yue Xie, and Li Zhao. 2025. "Diffusion-Based Model for Audio Steganography" Electronics 14, no. 20: 4019. https://doi.org/10.3390/electronics14204019
APA StyleXi, J., Xia, Z., Zhang, W., Xie, Y., & Zhao, L. (2025). Diffusion-Based Model for Audio Steganography. Electronics, 14(20), 4019. https://doi.org/10.3390/electronics14204019