This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Spectral Multi-Representation Fusion for Audio Deepfake Detection
by
Dora Ballesteros
Dora Ballesteros *
,
Daniel Suarez
Daniel Suarez
and
Cesar Pachon
Cesar Pachon
Facultad de Ingenieria, Universidad Militar Nueva Granada, Bogota 110111, Colombia
*
Author to whom correspondence should be addressed.
Algorithms 2026, 19(7), 549; https://doi.org/10.3390/a19070549 (registering DOI)
Submission received: 20 May 2026
/
Revised: 20 June 2026
/
Accepted: 1 July 2026
/
Published: 5 July 2026
Abstract
Audio deepfake detection systems often achieve excellent internal validation performance but fail to generalize under real-world inference conditions involving synthetic speech generated with previously unseen AI tools. To address this limitation, this work proposes the Spectral Multi-Representation Fusion (SMRF) framework, which integrates multiple spectral representations and decision-level fusion strategies to improve robustness under cross-domain conditions. Additionally, a Stability-Aware Multi-Metric Selection (SAMMS) strategy is introduced to select architectures by jointly considering predictive performance and cross-representation stability. The proposed framework was evaluated using four spectral representations (log-magnitude spectrogram (LOG), Mel spectrogram (MEL), Discrete Wavelet Transform (DWT), and Constant-Q Transform (CQT)) combined with multiple convolutional architectures and complementary voting strategies. The experiments revealed that isolated models exhibiting validation metrics above 95% may still produce very poor synthetic-audio detection rates during external inference (even lower than 10%). In contrast, fusion-based strategies substantially improved robustness by exploiting complementary synthetic evidence across spectral domains. The results also demonstrated that both the voting strategy and the SAMMS stability parameter strongly affect the final behavior of the system. In particular, hybrid fusion using One-Hard Voting with two architectures selected using achieved the best balance between synthetic-audio detection and real-audio preservation, outperforming individual models under cross-domain inference conditions, with detection rates close to 75% for both synthetic and real audio. These findings suggest that stability-aware fusion strategies constitute a promising direction for improving robustness in realistic audio deepfake detection scenarios.
Share and Cite
MDPI and ACS Style
Ballesteros, D.; Suarez, D.; Pachon, C.
Spectral Multi-Representation Fusion for Audio Deepfake Detection. Algorithms 2026, 19, 549.
https://doi.org/10.3390/a19070549
AMA Style
Ballesteros D, Suarez D, Pachon C.
Spectral Multi-Representation Fusion for Audio Deepfake Detection. Algorithms. 2026; 19(7):549.
https://doi.org/10.3390/a19070549
Chicago/Turabian Style
Ballesteros, Dora, Daniel Suarez, and Cesar Pachon.
2026. "Spectral Multi-Representation Fusion for Audio Deepfake Detection" Algorithms 19, no. 7: 549.
https://doi.org/10.3390/a19070549
APA Style
Ballesteros, D., Suarez, D., & Pachon, C.
(2026). Spectral Multi-Representation Fusion for Audio Deepfake Detection. Algorithms, 19(7), 549.
https://doi.org/10.3390/a19070549
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.