Hierarchical Deep Learning for File Fragment Classification
Abstract
1. Introduction
- A hierarchical classification structure with a dynamic adjustment mechanism is designed based on the relationships between file types.
- Dedicated classifiers are built for each leaf node, fully considering the unique data characteristics of each node, and selecting appropriate classification models for targeted training.
- Extensive experimental validation on the FFT-75 dataset demonstrates that our method exhibits outstanding performance, achieving state-of-the-art results.
2. Related Work
3. Methodology
3.1. Hierarchical Clustering Strategy
3.2. Leaf Node Optimization
| Algorithm 1 Hierarchical clustering with dynamic adjustment. |
|
3.3. Neural Network Classifiers
3.4. Architecture Variants
- P1 (deep architecture): This is used for complex leaf nodes requiring fine-grained discrimination (e.g., leaf nodes containing diverse multimedia formats). This variant comprises 4 convolutional layers with filter sizes , the first followed by max-pooling, and the second directly connected to global average pooling. The deeper architecture captures more complex patterns at the cost of increased computational requirements. The dropout rate is set to 0.2. The architecture is depicted in Figure 2.
- P2 (shallow architecture): This is used for simpler leaf nodes. This variant uses 2 convolutional layers with 32 filters each. The first convolutional layer is followed by max-pooling, while the second layer directly connects to the global average pooling layer, reducing computational complexity for simpler classification tasks. Dropout rate is set to 0.1. The architecture is depicted in Figure 3.
4. Experimental Evaluation
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Implementation Details
4.2. Experimental Results
- Level_3_0_0_0 (ARW, RW2, 3FR, DLL, WMA, PCAP, DWG): Groups raw image files with dynamic link libraries, audio files, and network packet files, possibly due to consistency in binary structure or data header features.
- Level_3_0_0_1 (EPS, MACH-O, ELF, DOC, MD, RTF, TXT, TEX, JSON, HTML, XML, LOG, CSV, SQLITE): Accurately groups text files, executable files, and code files, demonstrating shared storage structures and text markup features.
- Level_3_0_1_0 (NRW, RAF, XLS, TTF): Groups raw image formats with spreadsheet and font files, possibly due to similarities in file metadata or format markers.
- Level_3_1_0_0 (CR2, ORF, MOV, 3GP, WEBM, JAR, MOBI, PDF, AIFF, FLAC, M4A, WAV, BMP, KEY, PPTX): Successfully clusters multimedia files sharing specific encoding and packaging formats.
- Level_3_1_0_1 (JPG, DNG, TIFF, HEIC, PNG, MP4, AVI, MKV, OGV, APK, MSI, DMG, 7Z, MP3): Includes mainstream multimedia files, installation packages, and compressed files.
- Level_3_1_0_2 (BZ2, DEB, GZ, PKG, RAR, RPM, XZ, ZIP, EXE, DOCX, XLSX, DJVU, EPUB): Groups compressed files and advanced document formats.
- Level_3_1_0_3 (NEF, GIF, AI, PSD, PPT, OGG): Groups design files, image files, and audio files.
- Merged_Leaf (GPR, PEF): Groups raw image formats with unique storage characteristics.
- HEIC (20%): As a high-efficiency image container, its high entropy characteristics (from advanced compression) and fragmented data distribution lead to misclassification, mainly as 7z (27.7%), followed by mp4 (19.1%) and avi (11.3%).
- MP4 (21%): Combines compound structure (embedded video/audio/metadata) with high entropy (e.g., H.265 encoding), causing misclassification, mainly as 7z (28.4%), followed by avi (17.2%) and dmg (3.8%).
- DOCX (21%): Compound document format with embedded multimedia elements (images, formatting tags), resulting in cross-type misclassification, mainly as epub (16.2%), followed by bz2 (10.9%) and gz (10.6%).
- 7Z (21%)/BZ2 (19%)/XZ (16%): Exhibit extreme high entropy (redundancy elimination via deep compression), resulting in severe misclassification; specifically, XZ is mainly misclassified as bz2 (23.1%), followed by djvu (10.0%) and docx (8.5%), while BZ2 is mainly misclassified as xz (24.9%), followed by docx (9.1%) and deb (5.7%).
- EXE (18%): Windows executables’ complex structure (code/resources/strings) and high entropy drive misclassification, mainly as xz (27.2%), followed by bz2 (18.4%) and djvu (9.2%).
5. Conclusions
- 1.
- The dynamic adjustment mechanism has limited adaptability, currently only meeting the needs of conventional classification scenarios, and its performance in complex and dynamic classification scenarios needs improvement.
- 2.
- There is room for optimization in the targeted nature and architectural efficiency of leaf node classifiers. Existing models do not fully adapt to the unique characteristics of data from various categories, and the exploration of efficient training strategies is insufficient.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Beebe, N.L.; Maddox, L.A.; Liu, L.; Sun, M. Sceadan: Using concatenated n-gram vectors for improved file and data type classification. IEEE Trans. Inf. Forensics Secur. 2013, 8, 1519–1530. [Google Scholar] [CrossRef]
- Wang, F.; Quach, T.T.; Wheeler, J.; Aimone, J.B.; James, C.D. Sparse coding for n-gram feature extraction and training for file fragment classification. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2553–2562. [Google Scholar] [CrossRef]
- Veenman, C.J. Statistical disk cluster classification for file carving. In Third International Symposium on Information Assurance and Security; IEEE: New York, NY, USA, 2007. [Google Scholar]
- Sportiello, L.; Zanero, S. File block classification by support vector machine. In Sixth International Conference on Availability; IEEE: New York, NY, USA, 2011. [Google Scholar]
- Fitzgerald, S.; Mathews, G.; Morris, C.; Zhulyn, O. Using NLP techniques for file fragment classification. Digit. Investig. 2012, 9, S44–S49. [Google Scholar] [CrossRef]
- Li, Q.; Ong, A.; Suganthan, P.; Thing, V. A novel support vector machine approach to high entropy data fragment classification. In Proceedings of the SAISMC 2010, Port Elizabeth, South Africa, 17–18 May 2010. [Google Scholar]
- Bhat, K.; Lam, J.T.; Zulkernine, F. Content-based file type identification. In 2018 10th International Conference on Electrical and Computer Engineering (ICECE); IEEE: New York, NY, USA, 2018; pp. 277–280. [Google Scholar]
- Ahmed, I.; Lhee, K.S.; Shin, H.J.; Hong, M.P. Fast content-based file type identification. In IFIP International Conference on Digital Forensics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Hanis, F.M.; Khoshvaghti, H.; Teimouri, M.; Veisi, H. A language-independent approach to classification of textual file fragments: Case study of Persian, English, and Chinese languages. In 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE); IEEE: New York, NY, USA, 2021; pp. 254–259. [Google Scholar]
- Amirani, M.C.; Toorani, M.; Beheshti, A. A new approach to content-based file type detection. In 2008 IEEE Symposium on Computers and Communications; IEEE: New York, NY, USA, 2008; pp. 1103–1108. [Google Scholar][Green Version]
- Ahmed, I.; Lhee, K.S.; Shin, H.; Hong, M. Content-based file-type identification using cosine similarity and a divide-and-conquer approach. IETE Tech. Rev. 2010, 27, 465–477. [Google Scholar] [CrossRef]
- Sitompul, O.S.; Rahmat, R.F. Distributed autonomous Neuro-Gen learning engine for content-based document file type identification. In 2014 International Conference on Cyber and IT Service Management (CITSM); IEEE: New York, NY, USA, 2015. [Google Scholar]
- Karampidis, K.; Papadourakis, G. File type identification-Computational intelligence for digital forensics. J. Digit. Forensics Secur. Law 2017, 12, 6. [Google Scholar] [CrossRef][Green Version]
- Chen, Q.; Liao, Q.; Jiang, Z.L.; Fang, J.; Yiu, S.; Xi, G. File fragment classification using grayscale image conversion and deep learning in digital forensics. In 2018 IEEE Security and Privacy Workshops (SPW); IEEE: New York, NY, USA, 2018; pp. 140–147. [Google Scholar]
- Mittal, G.; Korus, P.; Memon, N. FiFTy: Large-scale file fragment type identification using convolutional neural networks. IEEE Trans. Inf. Forensics Secur. 2020, 16, 28–41. [Google Scholar] [CrossRef]
- Karres, M.; Shahmehri, N. File type identification of data fragments by their binary structure. In 2006 IEEE Information Assurance Workshop; IEEE: New York, NY, USA, 2006. [Google Scholar]
- Calhoun, W.C.; Coles, D. Predicting the types of file fragments. Digit. Investig. 2008, 5, S14–S20. [Google Scholar] [CrossRef]
- Masoumi, M.; Keshavarz, A.; Fotohi, R. File fragment recognition based on content and statistical features. Multimed. Tools Appl. 2021, 80, 18859–18874. [Google Scholar] [CrossRef]
- Bhatt, M.; Mishra, A.; Kabir, M.W.U.; Blake-Gatto, S.E.; Rajendra, R.; Hoque, M.T.; Ahmed, I. Hierarchy-based file fragment classification. Mach. Learn. Knowl. Extr. 2020, 2, 216–232. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, W.Y.; Wu, K.J.; Yap, K.H.; Chau, L.P. Intra- and inter-sector contextual information fusion with joint self-attention for file fragment classification. Knowl.-Based Syst. 2024, 291, 111565. [Google Scholar] [CrossRef]
- Alam, S.; Altiparmak, Z. Optimizing file fragment classification by mitigating class imbalance problem. In 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Park, J.G.; Liu, S.; Hong, J.H. XMP: A cross-attention multi-scale performer for file fragment classification. In 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2024; pp. 4505–4509. [Google Scholar]
- Liu, S.; Park, J.G.; Kim, H.S.; Hong, J.H. A cross-attention multi-scale performer with Gaussian bit-flips for file fragment classification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 2109–2121. [Google Scholar] [CrossRef]
- Saaim, K.M.; Felemban, M.; Alsaleh, S.; Almulhem, A. Light-weight file fragments classification using depthwise separable convolutions. In IFIP International Conference on ICT Systems Security and Privacy Protection; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 196–211. [Google Scholar]
- Zhu, N.; Liu, Y.; Wang, K.; Ma, C. File fragment type identification based on CNN and LSTM. In Proceedings of the 2023 7th International Conference on Digital Signal Processing; Association for Computing Machinery: New York, NY, USA, 2023; pp. 16–22. [Google Scholar]
- Wang, Y.; Wu, K.; Liu, W.; Yap, K.H.; Chau, L.P. Image representation and deep inception-attention for file-type and malware classification. In 2023 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Liu, W.; Wang, Y.; Wu, K.; Yap, K.H.; Chau, L.P. A byte sequence is worth an image: CNN for file fragment classification using bit shift and n-gram embeddings. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Mittal, G.; Korus, P.; Memon, N. File Fragment Type (FFT)-75 Dataset [EB/OL]. 2019. Available online: https://ieee-dataport.org/open-access/file-fragment-type-fft-75-dataset (accessed on 29 March 2026).



| No. | Classes | Acc. (%) | No. | Classes | Acc. (%) | No. | Classes | Acc. (%) |
|---|---|---|---|---|---|---|---|---|
| 1 | arw | 100 | 26 | nef | 98 | 51 | pptx | 69 |
| 2 | gpr | 100 | 27 | mkv | 98 | 52 | apk | 68 |
| 3 | nrw | 100 | 28 | doc | 98 | 53 | key | 68 |
| 4 | pef | 100 | 29 | m4a | 98 | 54 | 65 | |
| 5 | raf | 100 | 30 | mobi | 97 | 55 | jar | 63 |
| 6 | 3fr | 100 | 31 | txt | 97 | 56 | djvu | 58 |
| 7 | eps | 100 | 32 | jpg | 96 | 57 | flac | 58 |
| 8 | xls | 100 | 33 | tiff | 96 | 58 | rpm | 48 |
| 9 | json | 100 | 34 | psd | 96 | 59 | avi | 46 |
| 10 | xml | 100 | 35 | mach-o | 96 | 60 | dmg | 44 |
| 11 | log | 100 | 36 | ogg | 96 | 61 | zip | 40 |
| 12 | csv | 100 | 37 | elf | 94 | 62 | gz | 36 |
| 13 | aiff | 100 | 38 | pcap | 94 | 63 | mov | 34 |
| 14 | wav | 100 | 39 | xlsx | 93 | 64 | msi | 34 |
| 15 | wma | 100 | 40 | orf | 92 | 65 | epub | 29 |
| 16 | ttf | 100 | 41 | cr2 | 91 | 66 | deb | 26 |
| 17 | dwg | 100 | 42 | gif | 90 | 67 | pkg | 25 |
| 18 | rw2 | 99 | 43 | 3gp | 90 | 68 | rar | 23 |
| 19 | dll | 99 | 44 | bmp | 85 | 69 | mp4 | 21 |
| 20 | md | 99 | 45 | ppt | 84 | 70 | 7z | 21 |
| 21 | rtf | 99 | 46 | ogv | 82 | 71 | docx | 21 |
| 22 | tex | 99 | 47 | dng | 81 | 72 | heic | 20 |
| 23 | html | 99 | 48 | ai | 78 | 73 | bz2 | 19 |
| 24 | mp3 | 99 | 49 | webm | 71 | 74 | exe | 18 |
| 25 | sqlite | 99 | 50 | png | 70 | 75 | xz | 16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zou, B.; Liu, H. Hierarchical Deep Learning for File Fragment Classification. Electronics 2026, 15, 1507. https://doi.org/10.3390/electronics15071507
Zou B, Liu H. Hierarchical Deep Learning for File Fragment Classification. Electronics. 2026; 15(7):1507. https://doi.org/10.3390/electronics15071507
Chicago/Turabian StyleZou, Bailin, and Huiyi Liu. 2026. "Hierarchical Deep Learning for File Fragment Classification" Electronics 15, no. 7: 1507. https://doi.org/10.3390/electronics15071507
APA StyleZou, B., & Liu, H. (2026). Hierarchical Deep Learning for File Fragment Classification. Electronics, 15(7), 1507. https://doi.org/10.3390/electronics15071507
