Detecting Audio Copy-Move Forgeries on Mel Spectrograms via Hybrid Keypoint Features
Abstract
1. Introduction
- 1.
- In this study, we introduce an innovative approach by converting digital audio data into Mel spectrograms and processing them in the visual (image) domain. Unlike traditional methods that operate directly on raw audio signals, this technique integrates visual processing techniques for audio forgery detection. Such approaches remain relatively underexplored in the literature.
- 2.
- Instead of relying on a single keypoint detector and descriptor, the proposed method combines SIFT and FAST for keypoint detection, followed by SIFT and FREAK for descriptor generation. This hybrid feature extraction approach improves the robustness against forgery artifacts.
- 3.
- The method maintained stable detection accuracy across different distortion types, including additive noise, lossy compression, and filtering.
- 4.
- The experimental results show that our method performs better than existing approaches across all evaluation metrics. This finding highlights its potential as a dependable tool for forgery detection.
2. Related Work
3. Dataset
3.1. Dataset 1 (Arabic Speech-Based Copy–Move Set)
3.2. Dataset 2 (Turkish Speech in Three Environments)
4. Proposed Method
4.1. Generation of Mel Spectrogram Images
4.2. Preprocessing of Spectrogram Images
4.3. Keypoints Detection and Description
- Scale-space extrema detection: A Gaussian pyramid is constructed by convolving the input image () with Gaussian filters at varying scales (), as defined in Equation (3) [31].where denotes the scale-space representation of the input image, ∗ is the convolution operator, and is a Gaussian kernel defined in Equation (4) [31].where represents the spatial coordinates and is the scale parameter. Once the scale space is constructed, the Difference-of-Gaussian (DoG) images are obtained by subtracting adjacent Gaussian-blurred images within the same octave, as expressed in Equation (5) [31]:In this formulation, and denote the Gaussian images at scales and , respectively. Stable features are detected in the DoG images by comparing each pixel with its eight neighboring pixels at the same scale, as well as with nine neighbors in the scale above and nine in the scale below. A pixel is identified as a candidate keypoint if it constitutes a local extremum, either a maximum or a minimum. Unstable extrema are subsequently eliminated through a refinement procedure involving contrast thresholding and edge-response elimination.
- Orientation assignment: Each keypoint is assigned a dominant orientation based on local gradient directions, ensuring rotation invariance.
- Descriptor generation: A 128-dimensional feature vector is computed for each keypoint by partitioning its neighborhood into 4 × 4 subregions and calculating gradient histograms.
4.4. Matching and Filtering of Keypoints
- ✓
- Partitioning the Descriptor Set: The descriptor set (K) is evenly split by index into two subsets, and , to balance the computational load in subsequent matching.
- ✓
- BF Matcher and Lowe’s Ratio Test: For each descriptor in , the two best matches (m and n) within are identified using OpenCV’s BFMatcher with the L2 norm (Euclidean distance). A match is considered valid only if it satisfies Lowe’s ratio test [31], which filters ambiguous matches by comparing the distances of the best and second-best candidates, as shown in Equation (9) (see also [40]):
- ✓
- Recursive Partitioning: The recursive matching routine proceeds on the branch until the current subset contains fewer than two descriptors. This divide-and-conquer strategy reduces computational complexity for large descriptor sets.
- ✓
- Outlier Rejection with RANSAC: Finally, Random Sample Consensus (RANSAC) [41] is employed to eliminate residual false matches that may arise from repetitive patterns in the spectrogram. RANSAC iteratively estimates the dominant homography transformation between matched keypoints and evaluates their geometric consistency. Keypoint pairs whose reprojection error exceeds a predefined threshold (e.g., 3 pixels) are labeled as outliers and discarded, while consistent correspondences are retained as inliers. This filtering step ensures that only reliable matches are preserved, thereby improving the robustness of the tampering localization process.
5. Experimental Results
5.1. Post-Processing Attacks
5.2. Evaluation Metrics
5.3. Results and Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zakariah, M.; Khan, M.K.; Malik, H. Digital multimedia audio forensics: Past, present and future. Multimed. Tools Appl. 2018, 77, 1009–1040. [Google Scholar] [CrossRef]
- Ustubioglu, B.; Tahaoglu, G.; Ulutas, G. Detection of audio copy-move-forgery with novel feature matching on Mel spectrogram. Expert Syst. Appl. 2023, 213, 118963. [Google Scholar] [CrossRef]
- Güç, H.K.; Üstübioğlu, B.; Üstübioğlu, A.; Ulutaş, G. Audio Forgery Detection Method with Mel Spectrogram. In Proceedings of the 2023 16th International Conference on Information Security and Cryptology (ISCTürkiye), Ankara, Turkiye, 18–19 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
- Ulutas, G.; Tahaoglu, G.; Ustubioglu, B. Forge audio detection using keypoint features on mel spectrograms. In Proceedings of the 2022 45th International Conference on Telecommunications and Signal Processing (TSP), Virtual, Online, Czech Republic, 13–15 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 413–416. [Google Scholar]
- Ustubioglu, B.; Tahaoglu, G.; Ulutas, G.; Ustubioglu, A.; Kilic, M. Audio forgery detection and localization with super-resolution spectrogram and keypoint-based clustering approach. J. Supercomput. 2024, 80, 486–518. [Google Scholar] [CrossRef]
- Imran, M.; Ali, Z.; Bakhsh, S.T.; Akram, S. Blind detection of copy-move forgery in digital audio forensics. IEEE Access 2017, 5, 12843–12855. [Google Scholar] [CrossRef]
- Rosten, E.; Drummond, T. Machine Learning for High-Speed Corner Detection. In Computer Vision—ECCV 2006, Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 430–443. [Google Scholar]
- Yan, Q.; Yang, R.; Huang, J. Robust copy–move detection of speech recording using similarities of pitch and formant. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2331–2341. [Google Scholar] [CrossRef]
- Wang, F.; Li, C.; Tian, L. An algorithm of detecting audio copy-move forgery based on DCT and SVD. In Proceedings of the 2017 IEEE 17th International Conference on Communication Technology (ICCT), Chengdu, China, 27–30 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1652–1657. [Google Scholar]
- Xie, Z.; Lu, W.; Liu, X.; Xue, Y.; Yeung, Y. Copy-move detection of digital audio based on multi-feature decision. J. Inf. Secur. Appl. 2018, 43, 37–46. [Google Scholar] [CrossRef]
- Akdeniz, F.; Becerikli, Y. Recurrent neural network and long short-term memory models for audio copy-move forgery detection: A comprehensive study. J. Supercomput. 2024, 80, 17575–17605. [Google Scholar] [CrossRef]
- Akdeniz, F.; Becerikli, Y. Detecting audio copy-move forgery with an artificial neural network. Signal Image Video Process. 2024, 18, 2117–2133. [Google Scholar] [CrossRef]
- Wang, D.; Li, X.; Shi, C.; Niu, X.; Xiong, L.; Wu, H.; Qian, Q.; Qi, C. Robust copy-move detection and localization of digital audio based CFCC feature. Multimed. Tools Appl. 2024, 84, 9573–9589. [Google Scholar] [CrossRef]
- Xiao, J.-n.; Jia, Y.-z.; Fu, E.-d.; Huang, Z.; Li, Y.; Shi, S.-p. Audio authenticity: Duplicated audio segment detection in waveform audio file. J. Shanghai Jiaotong Univ. (Sci.) 2014, 19, 392–397. [Google Scholar] [CrossRef]
- Su, Z.; Li, M.; Zhang, G.; Wu, Q.; Li, M.; Zhang, W.; Yao, X. Robust audio copy-move forgery detection using constant Q spectral Sketches and GA-SVM. IEEE Trans. Dependable Secur. Comput. 2022, 20, 4016–4031. [Google Scholar] [CrossRef]
- Su, Z.; Li, M.; Zhang, G.; Wu, Q.; Wang, Y. Robust audio copy-move forgery detection on short forged slices using sliding window. J. Inf. Secur. Appl. 2023, 75, 103507. [Google Scholar] [CrossRef]
- Ustubioglu, A.; Ustubioglu, B.; Ulutas, G. Mel spectrogram-based audio forgery detection using CNN. Signal Image Video Process. 2023, 17, 2211–2219. [Google Scholar] [CrossRef]
- Dincer, S.; Ustubioglu, B.; Ulutas, G.; Tahaoglu, G.; Ustubioglu, A. Robust Audio Forgery Detection Method Based on Capsule Network. In Proceedings of the 2023 International Conference on Electrical and Information Technology (IEIT), Malang, Indonesia, 14–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 243–247. [Google Scholar]
- Yazici, S.; Üstübioğlu, B.; Kiliç, M.; Ulutaş, G. Block-Based Forgery Detection with Binary Gradient Model. In Proceedings of the 2022 15th International Conference on Information Security and Cryptography (ISCTURKEY), Ankara, Turkey, 19–20 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 38–43. [Google Scholar]
- Üstübioğlu, B.; Tahaoglu, G. Yüksek çözünlüklü spektrogram görüntülerinden akaze yöntemi ile ses sahteciliği tespiti. Kahramanmaraş Sütçü İmam Üniv. Mühendis. Bilim. Derg. 2023, 26, 961–972. [Google Scholar] [CrossRef]
- Ustubioglu, B.; Küçükuğurlu, B.; Ulutas, G. Robust copy-move detection in digital audio forensics based on pitch and modified discrete cosine transform. Multimed. Tools Appl. 2022, 81, 27149–27185. [Google Scholar] [CrossRef]
- Huang, X.; Liu, Z.; Lu, W.; Liu, H.; Xiang, S. Fast and effective copy-move detection of digital audio based on auto segment. In Digital Forensics and Forensic Investigations: Breakthroughs in Research and Practice; IGI Global: Hershey, PA, USA, 2020; pp. 127–142. [Google Scholar]
- Üstübioğlu, B.; Üstübioğlu, A. Görsel kelime tabanlı ses sahteciliği tespit yöntemi. Niğde Ömer Halisdemir Üniv. Mühendis. Bilim. Derg. 2024, 13, 350–358. [Google Scholar] [CrossRef]
- Ustubioglu, B. An Attack-Independent Audio Forgery Detection Technique Based on Cochleagram Images of Segments With Dynamic Threshold. IEEE Access 2024, 12, 82660–82675. [Google Scholar] [CrossRef]
- Arabic Speech Corpus. Available online: https://en.arabicspeechcorpus.com/ (accessed on 1 November 2025).
- Ustubioglu, B.; Tahaoglu, G.; Ayaz, G.O.; Ustubioglu, A.; Ulutas, G.; Cosar, M.; Kılıc, E.; Kılıc, M. KTUCengAudioForgerySet: A new audio copy-move forgery dataset. In Proceedings of the 2024 47th International Conference on Telecommunications and Signal Processing (TSP), Prague, Czech Republic, 10–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 123–129. [Google Scholar]
- Prakash, C.S.; Panzade, P.P.; Om, H.; Maheshkar, S. Detection of copy-move forgery using AKAZE and SIFT keypoint extraction. Multimed. Tools Appl. 2019, 78, 23535–23558. [Google Scholar] [CrossRef]
- Hosny, K.M.; Hamza, H.M.; Lashin, N.A. Copy-move forgery detection of duplicated objects using accurate PCET moments and morphological operators. Imaging Sci. J. 2018, 66, 330–345. [Google Scholar] [CrossRef]
- Panzade, P.P.; Prakash, C.S.; Maheshkar, S. Copy-move forgery detection by using HSV preprocessing and keypoint extraction. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 264–269. [Google Scholar]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
- Harris, C.G.; Stephens, M.J. A Combined Corner and Edge Detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988. [Google Scholar]
- Developers, O. OpenCV SIFT Feature Detector and Descriptor. 2025. Available online: https://docs.opencv.org/4.x/da/df5/tutorial_py_sift_intro.html (accessed on 22 October 2025).
- Bresenham, J. A linear algorithm for incremental digital display of circular arcs. Commun. ACM 1977, 20, 100–106. [Google Scholar] [CrossRef]
- Huang, J.; Zhou, G.; Zhou, X.; Zhang, R. A New FPGA Architecture of FAST and BRIEF Algorithm for On-Board Corner Detection and Matching. Sensors 2018, 18, 1014. [Google Scholar] [CrossRef]
- Alahi, A.; Ortiz, R.; Vandergheynst, P. Freak: Fast retina keypoint. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 510–517. [Google Scholar]
- Huang, H.; Guo, W.; Zhang, Y. Detection of copy-move forgery in digital images using SIFT algorithm. In Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application, Wuhan, China, 19–20 December 2008; Volume 2, pp. 272–276. [Google Scholar]
- Developers, O. OpenCV Brute-Force Feature Matching. 2025. Available online: https://docs.opencv.org/4.x/dc/dc3/tutorial_py_matcher.html (accessed on 22 October 2025).
- Hammoud, M.; Getahun, M.; Lupin, S. Comparison of Outlier Filtering Methods in Terms of Their Influence on Pose Estimation Quality. Int. J. Open Inf. Technol. 2023, 11, 1–5. [Google Scholar]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Çoban, Ö. An assessment of nature-inspired algorithms for text feature selection. Comput. Sci. 2022, 23, 179–204. [Google Scholar] [CrossRef]
- Çoban, Ö.; Yücel Altay, Ş. Arming text-based gender inference with partition membership filtering and feature selection for online social network users. Comput. J. 2025, 68, 1208–1224. [Google Scholar] [CrossRef]
- Broussard, M.; Diakopoulos, N.; Guzman, A.L.; Abebe, R.; Dupagne, M.; Chuan, C.H. Artificial intelligence and journalism. J. Mass Commun. Q. 2019, 96, 673–695. [Google Scholar] [CrossRef]
- Chesney, B.; Citron, D. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. Law Rev. 2019, 107, 1753. [Google Scholar] [CrossRef]
- Vaccari, C.; Chadwick, A. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news. Soc. Media Soc. 2020, 6, 2056305120903408. [Google Scholar] [CrossRef]





| Study | Analysis Approach | Detection Approach | Similarity Method | Dataset |
|---|---|---|---|---|
| [14] | Window-based | - | Fast Convolution | Authors’ dataset |
| [16] | Window-based | CQCC | PCC | Librispeech Chinspeech |
| [8] | VAD-based | Pitch and formant sequences | DTW | TIMIT Wall Street Journal |
| [6] | VAD-based | LBP | MSE, energy ratio | KSU Arabic Speech |
| [9] | VAD-based | DCT-SVD | ED | Authors’ dataset |
| [10] | VAD-based | C4.5 DT integrating four models (MFCC, gammatone, pitch, DFT) | PCC and AD | Authors’ dataset |
| [11] | VAD-based | MFCC, RNN, and LSTM | Model-based classification | TIMIT |
| [12] | VAD-based | MFCC, , , LPC, and ANN | Model-based classification | TIMIT |
| [21] | VAD-based | Pitch and Modified DCT | ED | TIMIT |
| [13] | VAD-based | Pitch and CFCC | PCC and DTW | TIMIT |
| [22] | VAD-based | DFT | PCC | Authors’ dataset |
| [23] | VAD and Spectrogram-based | DSSIM | DSSIM | TIMIT |
| [24] | VAD and Spectrogram-based | Pitch and SSIM | SSIM | TIMIT Arabic Speech Corpus |
| [2] | Spectrogram-based | SIFT | ED | TIMIT Arabic Speech Corpus |
| [3] | Spectrogram-based | SIFT | Moment invariants | TIMIT |
| [4] | Spectrogram-based | SURF | g2NN | TIMIT |
| [5] | Spectrogram-based | BRIEF | OPTICS | TIMIT Arabic Speech Corpus |
| [17] | Spectrogram-based | CNN | Model-based classification | TIMIT Arabic Speech Corpus |
| [18] | Spectrogram-based | Capsule Network | Model-based classification | Arabic Speech Corpus |
| [19] | Spectrogram-based | BGP | ED | TIMIT |
| [20] | Spectrogram-based | AKAZE | g2NN | TIMIT |
| ARANORM0954_2-3.wav | ARANORM0033_1-3.wav | ARANORM0701_2-3.wav | |
|---|---|---|---|
| NA | ![]() Number of keypoints: 4080 Number of matching points After BF-Matcher: 209 After RANSAC: 172 | ![]() Number of keypoints: 3217 Number of matching points After BF-Matcher: 761 After RANSAC: 740 | ![]() Number of keypoints: 3467 Number of matching points After BF-Matcher: 557 After RANSAC: 530 |
| COM | ![]() Number of keypoints: 4216 Number of matching points After BF-Matcher: 91 After RANSAC: 71 | ![]() Number of keypoints: 3483 Number of matching points After BF-Matcher: 175 After RANSAC: 149 | ![]() Number of keypoints: 3603 Number of matching points After BF-Matcher: 139 After RANSAC: 115 |
| MDF | ![]() Number of keypoints: 3865 Number of matching points After BF-Matcher: 199 After RANSAC: 150 | ![]() Number of keypoints: 3670 Number of matching points After BF-Matcher: 737 After RANSAC: 711 | ![]() Number of keypoints: 3598 Number of matching points After BF-Matcher: 589 After RANSAC: 548 |
| GN20 | ![]() Number of keypoints: 2632 Number of matching points After BF-Matcher: 59 After RANSAC: 44 | ![]() Number of keypoints: 2488 Number of matching points After BF-Matcher: 82 After RANSAC: 74 | ![]() Number of keypoints: 2711 Number of matching points After BF-Matcher: 132 After RANSAC: 118 |
| GN30 | ![]() Number of keypoints: 3525 Number of matching points After BF-Matcher: 114 After RANSAC: 85 | ![]() Number of keypoints: 2547 Number of matching points After BF-Matcher: 222 After RANSAC: 210 | ![]() Number of keypoints: 2988 Number of matching points After BF-Matcher: 299 After RANSAC: 284 |
| Accuracy | Precision | Recall | -Score | |
|---|---|---|---|---|
| Dataset 1 (Arabic) | 0.9505 | 0.9968 | 0.8839 | 0.9370 |
| Dataset 2 (Turkish) | 0.9192 | 0.9247 | 0.9503 | 0.9373 |
| Attack | Accuracy | Precision | Recall | -Score |
|---|---|---|---|---|
| 64 kbps compression | 0.8520 | 0.9957 | 0.6476 | 0.7847 |
| Median filtering | 0.9499 | 0.9968 | 0.8825 | 0.9362 |
| Gaussian noise (SNR = 20 dB) | 0.8427 | 0.9955 | 0.6252 | 0.7680 |
| Gaussian noise (SNR = 30 dB) | 0.9003 | 0.9964 | 0.7636 | 0.8646 |
| Attack | Accuracy | Precision | Recall | -Score |
|---|---|---|---|---|
| 64 kbps compression | 0.8834 | 0.9203 | 0.8939 | 0.9069 |
| Median filtering | 0.9162 | 0.9243 | 0.9455 | 0.9348 |
| Gaussian noise (SNR = 20 dB) | 0.7801 | 0.9043 | 0.7314 | 0.8087 |
| Gaussian noise (SNR = 30 dB) | 0.8663 | 0.9180 | 0.8671 | 0.8918 |
| Attacks | Formant [8] | LBP [6] | DCT-SVD [9] | DFT [22] | This Study |
|---|---|---|---|---|---|
| 64 kbps compression | 0.57 | 0.30 | 0.30 | 0.20 | 0.8520 |
| Median filtering | 0.57 | 0.10 | 0.40 | 0.20 | 0.9499 |
| Gaussian noise (SNR = 20 dB) | 0.63 | 0.50 | 0.60 | 0.20 | 0.8427 |
| Gaussian noise (SNR = 30 dB) | 0.60 | 0.40 | 0.40 | 0.16 | 0.9003 |
| Attacks | Formant [8] | LBP [6] | DCT-SVD [9] | DFT [22] | This Study |
|---|---|---|---|---|---|
| 64 kbps compression | 0.1859 | 0.3682 | 0.4617 | 0.4957 | 0.8834 |
| Median filtering | 0.2369 | 0.5067 | 0.5237 | 0.5328 | 0.9162 |
| Gaussian noise (SNR = 20 dB) | 0.2096 | 0.3779 | 0.4970 | 0.5243 | 0.7801 |
| Gaussian noise (SNR = 30 dB) | 0.2309 | 0.3779 | 0.4927 | 0.5322 | 0.8663 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ozgen, E.; Altay, S.Y. Detecting Audio Copy-Move Forgeries on Mel Spectrograms via Hybrid Keypoint Features. Appl. Sci. 2025, 15, 11845. https://doi.org/10.3390/app152111845
Ozgen E, Altay SY. Detecting Audio Copy-Move Forgeries on Mel Spectrograms via Hybrid Keypoint Features. Applied Sciences. 2025; 15(21):11845. https://doi.org/10.3390/app152111845
Chicago/Turabian StyleOzgen, Ezgi, and Seyma Yucel Altay. 2025. "Detecting Audio Copy-Move Forgeries on Mel Spectrograms via Hybrid Keypoint Features" Applied Sciences 15, no. 21: 11845. https://doi.org/10.3390/app152111845
APA StyleOzgen, E., & Altay, S. Y. (2025). Detecting Audio Copy-Move Forgeries on Mel Spectrograms via Hybrid Keypoint Features. Applied Sciences, 15(21), 11845. https://doi.org/10.3390/app152111845
















