Hierarchical Deep Learning for File Fragment Classification

Zou, Bailin; Liu, Huiyi

doi:10.3390/electronics15071507

Open AccessArticle

Hierarchical Deep Learning for File Fragment Classification

by

Bailin Zou

^1,2,* and

Huiyi Liu

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

²

Institute of Information Technology, Nanjing Police University, Nanjing 210042, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1507; https://doi.org/10.3390/electronics15071507

Submission received: 22 February 2026 / Revised: 11 March 2026 / Accepted: 30 March 2026 / Published: 3 April 2026

(This article belongs to the Special Issue Digital Security and Privacy Protection: Trends and Applications, 3rd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

File fragment classification is crucial in digital forensics, aiding in the recovery and reconstruction of fragmented files, which serve as key evidence; while deep learning techniques have advanced in this area, challenges remain, particularly regarding the consideration of inter-file-type relationships and the granularity of classification. To overcome these challenges, we introduce a hierarchical classification approach that leverages an agglomerative hierarchical clustering algorithm combined with a dynamic adjustment mechanism, optimizing category distribution among leaf nodes. This structure is further enhanced by developing specific classifiers for each leaf node, tailored to its unique characteristics. Experimental results on the FFT-75 dataset show that our method achieves 76.3% accuracy in a 75-class scenario (512-byte blocks), surpassing the accuracy achieved with existing approaches. This method improves classification accuracy, addressing misclassification issues caused by excessive classification types.

Keywords:

file fragment type identification; hierarchical classification structure; deep learning; digital forensics; file carving

1. Introduction

Digital forensics is a method used to investigate and analyze digital devices or data, playing a crucial role in fields such as criminal investigations, legal proceedings, and cyber security. It aids investigators in collecting, analyzing, and preserving digital evidence, thereby uncovering criminal activities, establishing case facts, and safeguarding data security and privacy.

In digital forensics, file fragment classification is critical. With the diversification of file storage and transmission methods, files are often divided into multiple fragments and stored in different locations, posing challenges for digital forensics. Particularly noteworthy is that files are frequently damaged or deliberately deleted by human intervention, leading to partial overwriting of file content. Consequently, investigative clues and evidence that forensic examiners attempt to acquire from storage media (including raw flash media) or memory images often do not exist as intact files but reside as fragments on the media. File fragment classification analyzes the content and characteristics of file fragments, helping to restore and reconstruct divided files, thereby recovering complete digital evidence. It can identify different types of file fragments, including text, images, and videos, providing valuable information and clues for digital forensics. By leveraging file fragment classification, investigators can analyze case details more accurately, locate key evidence, expedite case resolution, and improve the efficiency and accuracy of forensic investigations.

Digital forensics, a critical tool for investigating cyber crimes and network attacks, faces the challenge of recovering and identifying file types when metadata is absent. Traditional recovery methods based on file system metadata become ineffective in such scenarios, prompting forensic analysts to rely on file carving techniques to reconstruct files based on their content rather than metadata. However, classifying the file fragments generated during the file carving process has emerged as a research hot spot. Existing methods can be categorized into two types: magic-number-based and content-based approaches. Magic-number-based methods have limitations when dealing with files lacking distinct magic numbers or highly fragmented files, making content-based methods the focus of increasing attention.

Traditional file fragment classification methods, such as those relying on file extensions and magic numbers, perform poorly due to loss of metadata or file fragmentation. As a result, researchers have turned to machine learning techniques. They utilize feature extraction methods like n-grams [1,2], Shannon entropy [3,4,5], or Kolmogorov complexity [3], combined with machine learning algorithms such as support vector machines [1,4,5,6,7], decision trees [8,9], sparse coding [2], and neural networks [8,10,11,12,13], to classify the type of a given file fragment. However, these methods, although performing well in certain cases, are still affected by file fragmentation, especially in severely fragmented scenarios. Additionally, methods relying on manually extracted features have several drawbacks. They are time-consuming, labor-intensive, highly subjective, limited in generalization ability, and lack interpretability. These issues constrain model performance.

In recent years, deep learning has made significant progress in file fragment classification. By utilizing deep learning models such as convolutional neural networks (CNNs), tools like Gray-scale [14] and FiFTy [15] have demonstrated effectiveness in file fragment classification. However, despite CNNs being one of the most commonly used deep learning models for classification tasks, existing solutions still have room for improvement in terms of performance and accuracy. For instance, they fail to consider the relationships between different file types and overlook the granularity of classification.

Therefore, to avoid explicit feature computation, we explore deep learning methods. Unlike previous works, we fully consider file type relationships and design a hierarchical classification structure based on agglomerative hierarchical clustering. Additionally, we introduce a dynamic adjustment mechanism to optimize category distribution among leaf nodes, enhancing the rationality and stability of the hierarchical structure. Building on this foundation, we develop dedicated classifiers for each leaf node and select appropriate classifiers for training based on leaf node characteristics. This design ensures optimized classifiers for specific leaf nodes categories, significantly improving accuracy and efficiency of file fragment classification. It alleviates misclassification problems caused by an excessive number of classification types, and the hierarchical structure model is simple and easy to expand. Our contributions are as follows:

A hierarchical classification structure with a dynamic adjustment mechanism is designed based on the relationships between file types.
Dedicated classifiers are built for each leaf node, fully considering the unique data characteristics of each node, and selecting appropriate classification models for targeted training.
Extensive experimental validation on the FFT-75 dataset demonstrates that our method exhibits outstanding performance, achieving state-of-the-art results.

The organization of the paper is as follows:

In Section 2, we provide a systematic review and overview of related work in file fragment classification. Section 3 details our proposed hierarchical clustering algorithm and the design of dedicated classifiers. In Section 4, we report the results of the experimental evaluation, validating the effectiveness and superiority of the proposed model through data analysis and comparisons. Finally, in Section 5, we summarize the main findings and contributions of the study and highlight potential directions for future research. The code implementation of this framework is publicly available on GitHub (version 1.0.0) for reproducibility and further research, and can be accessed at: https://github.com/zbl-test/Hierarchical_FFC/releases/tag/V1.0.0 (accessed on 29 March 2026).

2. Related Work

The algorithm proposed by Karresand et al. [16] classifies data fragments by measuring the rate of change in the byte content of digital media and extends the Oscar method based on byte frequency distribution presented in previous papers. Evaluation of the new method showed a detection rate of 99.2% when scanning JPEG data, with no false positives. Even the slowest implementation could scan a 72.2MB file in approximately 2.5 s and demonstrated linear scalability. Calhoun et al. [17] proposed two algorithms: one based on Fisher Linear Discriminant Analysis and the other on the Longest Common Sub-Sequence, to improve accuracy and efficiency of file type prediction. Experimental results show these methods have potential in predicting file types, especially when handling file fragments without header information. Masoumi et al. [18] divided files into different fragments and then extracted features of file fragments using Binary Frequency Distribution (BFD). They performed dimensionality reduction using the SFS and SFFS algorithms and utilized MLP, SVM, and KNN classifiers for type identification. Results showed this method could effectively identify six common file types. Bhatt et al. [19] proposed a hierarchical machine-learning-based file fragment classification method, using optimized support vector machines (SVMs) as the base classifier and defining a raw classification system for file types. Evaluation on a dataset containing 14 file types showed the method achieved an average accuracy of 67.78% and F1-measure of 65% under 10-fold cross-validation. Hanis et al. [9] used machine learning methods to classify five common text file formats: PDF, DOC, DOCX, RTF, and TXT. They also examined the impact of contextual language on file fragment classification. When training and testing in the same language environment, the classification accuracies for Persian, English, and Chinese reached 85.6%, 76.4%, and 86.1%, respectively. In different language environments, the average accuracies dropped to 60.0%, 58.5%, and 71.4%. Karampidis et al. [13] proposed a file type recognition method based on computational intelligence techniques, achieving accurate detection of common image file types (jpg, png, gif) and uncompressed tiff images through a three-stage process involving feature extraction, genetic algorithm-based feature selection, and neural network classification. Experimental results showed this method had very high recognition accuracy for tampered image file types. Wang et al. [2] proposed an automated feature extraction method based on sparse coding, which, by learning sparse dictionaries of n-grams of different sizes, could estimate the n-gram frequency of a given file fragment without being affected by combinatorial explosion. The statistical and machine-learning-based methods above generally have several issues, such as the need for manual feature extraction, reliance on experience, time consumption, and limited focus on file types.

Recent deep learning advances have driven file fragment classification toward end-to-end feature learning and performance breakthroughs. Wang et al. [20] proposed JSANet, a joint self-attention network integrating byte, channel and sector self-attention modules, which fuses intra-sector local features and inter-sector contextual information and improves accuracy by over 16.3% on the self-built variable-length fragment dataset VFF-16. Aiming at the class imbalance problem in forensic datasets, Alam et al. [21] designed a hybrid resampling framework combining SMOTE oversampling and ENN/Tomek undersampling, cooperating with TF-IDF feature selection and random forest classifier, which increased the weighted average true positive rate to 81.6%. Different from traditional CNNs, Park et al. [22] constructed XMP, the first Transformer encoder-based model for file fragment classification, which adopted multi-scale self-attention and cross-attention with Performer complexity optimization and achieved state-of-the-art results on FFT-75. On this basis, Liu et al. [23] introduced Gaussian Bit-Flip (GBFlip) binary data augmentation and parameter-efficient fine-tuning strategies for XMP, which alleviated the inductive bias of CNNs and enhanced the cross-domain generalization ability of the model.

Mittal et al. [15] designed a modern file-type-identification tool for memory forensics and data recovery, called FiFTy. It uses a compact neural network architecture in a trainable embedding space, eliminating the need for explicit feature extraction. Evaluated on their proposed new dataset of 75 file types, FFT-75, FiFTy outperforms all baseline methods in speed, accuracy, and individual misclassification rates. Specifically, FiFTy is an order of magnitude faster than the previous tool, Sceadan [1], with an average accuracy of 77.5% and a processing speed of about 38 s per GB. Saaim et al. [24] proposed a new lightweight file fragment classification method that uses depth-separable convolutional neural networks to improve classification accuracy and speed while reducing computational resource requirements, achieving an accuracy of 78.45% on the FFT-75 dataset. Zhu et al. [25] proposed a file fragment type identification method based on convolutional neural networks (CNN) and long short-term memory networks (LSTM), achieving an average accuracy of 66.5% (512 bytes) and 78.6% (4096 bytes) on the FFT-75 dataset. Wang et al. [26] proposed a method combining image representation and a deep Inception-Attention mechanism for file type and malware classification. By converting the data sequences in memory blocks into two-dimensional binary images and using the deep Inception-Attention network to extract features and predict file types, this method showed superiority in large-scale benchmark tests. Furthermore, it can be extended to malware classification tasks and achieved significant performance. Liu et al. [27] introduced a new data augmentation technique called Byte2Image for file fragment classification. This technique views file fragments in small memory blocks as two-dimensional gray-scale images and incorporates previously ignored byte-level bit information to capture inter-byte relationships. Experimental results show that this method achieves higher accuracy than existing methods on the FFT-75 dataset. Although some researchers have explored deep learning methods in recent years, there are still some issues, such as insufficient accuracy and failure to consider inter-file-type relationships.

3. Methodology

The file fragment classification problem aims to identify a function f, which can accurately map input data

x

, the raw data of memory blocks or file fragments, to the correct file type label y:

f : x \in Z_{255}^{N_{s}} \to y \in {JPG, PNG, \dots, DOCX}

(1)

where

Z_{255} = {0, 1, \dots, 255}

and

N_{s}

denotes the length of the byte sequence (512 or 4096). To address this problem, we consider file type relationships and propose a multi-level classification method combining hierarchical clustering with deep learning classifiers. By constructing a hierarchical category tree structure, we incorporate category relationships into the classification model, improving accuracy and robustness for complex file fragment classification tasks.

3.1. Hierarchical Clustering Strategy

Given a training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

with C classes, where each sample

x_{i}

is a 512-byte file fragment and forms the feature space for hierarchical clustering—a 512-dimensional raw byte feature space. In this space, each dimension corresponds to the original byte value of the file fragment at a specific position, with the byte value ranging from 0 to 255 (i.e.,

x_{i} \in Z_{255}^{512}

). We first compute the category-wise mean feature vector for each class c:

μ_{c} = \frac{1}{N_{c}} \sum_{i : y_{i} = c} x_{i}

(2)

where

N_{c}

is the number of samples in class c. These mean vectors represent class prototypes in the feature space.

We perform agglomerative hierarchical clustering on the set of prototypes

M = {μ_{1}, \dots, μ_{C}}

in three levels. At each level, we apply Ward’s linkage criterion to minimize the within-cluster variance. The clustering yields a tree structure where the root node contains all classes, internal nodes represent intermediate groupings, and leaf nodes contain subsets of classes for final classification.

Let

T

denote the resulting tree structure. Each leaf node

S \in T

represents a subset of classes that will be handled by a dedicated classifier. The hierarchical decomposition allows classifiers to focus on discriminating between visually or semantically similar classes grouped together in the same branch.

3.2. Leaf Node Optimization

To ensure effective training, we enforce constraints on the cardinality of leaf nodes. Empirically, we found that nodes with too few classes (<4) lead to overfitting, while nodes with too many classes (>15) make the classification task overly difficult.

For any leaf node S with

| S | < τ_{min}

(where

τ_{min} = 4

), we merge it with its sibling node

S^{'}

subject to the constraint that the merged node does not exceed the maximum size

τ_{max} = 15

:

S^{'} \leftarrow S^{'} \cup S, if | S | < τ_{min} and | S^{'} \cup S | \leq τ_{max}

(3)

Conversely, for nodes with

| S | > τ_{max}

, we split the node into two balanced subsets by bipartitioning the class set into two halves of approximately equal size. This optimization ensures that all leaf nodes contain between 4 and 15 classes, striking a balance between specialization and sufficient training data per class. If no suitable sibling node is available for merging (i.e., all siblings would exceed the maximum size limit after merging), a new leaf node is created to accommodate the categories that would otherwise be left without a valid assignment. The threshold settings were primarily determined empirically based on the distribution of the 75 file types in the FFT-75 dataset.

The resulting hierarchical category tree structure with dynamic adjustment is illustrated in Figure 1, which fully describes the hierarchical relationships among categories while maintaining a balanced distribution of categories at the leaf nodes.

The overall hierarchical clustering procedure with dynamic adjustment is summarized in Algorithm 1.

Algorithm 1 Hierarchical clustering with dynamic adjustment.

Require:: Category mean vectors $M = {μ_{c}}_{c = 1}^{C}$ ; number of clusters at each level
Ensure:: Hierarchical tree structure $T$
1:: Perform Level 1 clustering with Ward’s linkage to obtain $K_{1}$ clusters
2:: for each Level 1 cluster i do
3:: Perform Level 2 clustering to obtain $K_{2}^{(i)}$ sub-clusters
4:: for each Level 2 sub-cluster j do
5:: Perform Level 3 clustering to obtain $K_{3}^{(i, j)}$ leaf candidates
6:: for each Level 3 candidate k with label set S do
7:: if $| S | < τ_{min}$ then
8:: Merge S into sibling node using Equation (3)
9:: else if $| S | > τ_{max}$ then
10:: Split S into two balanced sub-nodes
11:: else
12:: Assign S as final leaf node
13:: end if
14:: end for
15:: end for
16:: end for
17:: return Optimized tree structure $T$

This clear organizational framework enhances the training and optimization of classifiers, improving classification performance.

3.3. Neural Network Classifiers

For each leaf node S containing M classes, we train a convolutional neural network classifier

f_{S} : R^{N_{s}} \to R^{M}

that maps an input byte sequence to class probabilities.

Embedding layer: The discrete input sequence (byte values ranging from 0 to 255) is first transformed into dense vector representations through an embedding lookup operation. The embedding layer maps 256 possible byte values (vocabulary size) to

d_{e} = 64

-dimensional vectors, producing a matrix

E \in R^{N_{s} \times d_{e}}

.

Convolutional feature extraction: The network employs multiple one-dimensional convolutional layers to extract hierarchical features. Each convolutional layer applies a set of learnable filters to the input feature maps, followed by a non-linear activation function. We employ the Leaky Rectified Linear Unit (LeakyReLU) as the activation function:

ϕ (z) = max (z, α z) = \{\begin{matrix} z, & z > 0 \\ α z, & z \leq 0 \end{matrix}

(4)

where

α = 0.3

is a hyperparameter controlling the negative slope. This activation mitigates the dying ReLU problem by allowing small negative gradients to flow through.

Following each convolutional layer (except the last), we apply max-pooling with pooling size 2 to reduce the spatial dimensions and increase the receptive field. The pooling operation computes the maximum value within each local neighborhood.

Global aggregation: After the final convolutional block, we apply global average pooling to aggregate spatial information across the entire sequence, producing a fixed-dimensional feature vector

g \in R^{F}

regardless of the input length.

Classification head: The pooled features pass through a fully-connected layer with dropout regularization (dropout rate

p \in {0.1, 0.2}

), followed by the final linear transformation and softmax activation to obtain class probabilities:

\hat{y} = softmax (W \cdot dropout (g) + b)

(5)

where

W \in R^{M \times F}

and

b \in R^{M}

are learnable parameters.

The network is trained by minimizing the categorical cross-entropy loss between the predicted probabilities and the ground truth one-hot encoded labels.

3.4. Architecture Variants

We employ two architecture variants tailored to the complexity of different leaf nodes:

P1 (deep architecture): This is used for complex leaf nodes requiring fine-grained discrimination (e.g., leaf nodes containing diverse multimedia formats). This variant comprises 4 convolutional layers with filter sizes $(32, 64, 128, 128)$ , the first followed by max-pooling, and the second directly connected to global average pooling. The deeper architecture captures more complex patterns at the cost of increased computational requirements. The dropout rate is set to 0.2. The architecture is depicted in Figure 2.

P2 (shallow architecture): This is used for simpler leaf nodes. This variant uses 2 convolutional layers with 32 filters each. The first convolutional layer is followed by max-pooling, while the second layer directly connects to the global average pooling layer, reducing computational complexity for simpler classification tasks. Dropout rate is set to 0.1. The architecture is depicted in Figure 3.

The selection of network architectures is determined by an empirical evaluation of the classification task complexity for each leaf node, which is systematically conducted based on two core criteria. First is the feature complexity of the categories within each leaf node: P1 is adopted if a leaf node contains high-entropy file types (e.g., compressed files, multimedia files, installation packages) and files with complex binary structures (e.g., compound document formats, embedded multi-stream media files); in contrast, P2 is used for leaf nodes consisting of low-entropy file types with regular binary structures (e.g., plain text files, raw image files, log files). Second is the baseline classification accuracy: a shallow P2-like baseline model is first trained for preliminary classification on each leaf node; P1 is then employed to enhance the model’s feature extraction capability if the average baseline classification accuracy of the leaf node is below 60%, while P2 is sufficient to achieve accurate classification when the baseline accuracy exceeds 60%. Specifically, the leaf nodes Level_3_1_0_1 and Level_3_1_0_2 contain a variety of multimedia formats characterized by high entropy and complex binary structures, and their classification accuracy on the shallow P2-like baseline model is below 60%, thus satisfying both criteria for P1 adoption and being configured with the P1 classifier. All other leaf nodes comprise low-entropy categories with regular binary structures and yield a classification accuracy above 60% on the P2-like baseline model, which meets the requirements for P2 application, and are therefore uniformly equipped with the P2 classifier. This heuristic selection strategy not only ensures that complex classification tasks are supported by adequate model capacity but also preserves the computational efficiency of simple classification tasks.

The classifier system, designed based on the characteristics of each leaf node, ensures that the classifier for each node can efficiently handle its corresponding category, thereby improving the overall classification accuracy of the hierarchical structure.

4. Experimental Evaluation

4.1. Experimental Setup

4.1.1. Dataset

We evaluated our method on the large-scale dataset FFT-75 [28], the largest corpus to date for file fragment type classification. It comprises six scenarios, considering both 512-byte and 4096-byte file system clusters. Scenario 1 contains 75 file types, such as JPEG, PDF, and ZIP, with a total of 7.5 million samples. For our experiments, we utilized the 512-byte clusters from Scenario 1.

4.1.2. Implementation Details

The hierarchical clustering algorithm initializes the number of clusters as follows: the first layer (Level 1) consists of 2 clusters, the second layer (Level 2) divides each cluster from the first layer into 2 or 3 sub-clusters, and the third layer (Level 3) further subdivides the sub-clusters from the second layer based on the structure

[[2, 1], [4, 1, 2]]

. The thresholds for merging and splitting leaf nodes are set to

τ_{min} = 4

and

τ_{max} = 15

, respectively.

Both the P1 and P2 classifiers use an embedding size of 64, with dropout rates of 0.2 and 0.1, respectively. The optimizer is Adam, and the loss function is categorical cross-entropy. Each model is trained for 30 epochs.

We evaluate our model using per-class accuracy and average accuracy, calculated as follows:

{Accuracy}_{i} = \frac{a_{i}}{b_{i}}, Avg_Accuracy = \frac{1}{n} \sum_{i = 1}^{n} \frac{a_{i}}{b_{i}}

(6)

where

a_{i}

denotes the number of correctly predicted samples for class i,

b_{i}

is the total number of samples for class i, and n is the total number of classes.

Our experimental environment is as follows: CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20 GHz; GPU: Tesla P40 24 GB; RAM: 106 GB; OS: Ubuntu 16.04.

4.2. Experimental Results

The clustering results shown in Figure 1 accurately reflect feature similarities and potential associations between file types, demonstrating the effectiveness of the clustering algorithm for file classification. Key observations include:

Level_3_0_0_0 (ARW, RW2, 3FR, DLL, WMA, PCAP, DWG): Groups raw image files with dynamic link libraries, audio files, and network packet files, possibly due to consistency in binary structure or data header features.
Level_3_0_0_1 (EPS, MACH-O, ELF, DOC, MD, RTF, TXT, TEX, JSON, HTML, XML, LOG, CSV, SQLITE): Accurately groups text files, executable files, and code files, demonstrating shared storage structures and text markup features.
Level_3_0_1_0 (NRW, RAF, XLS, TTF): Groups raw image formats with spreadsheet and font files, possibly due to similarities in file metadata or format markers.
Level_3_1_0_0 (CR2, ORF, MOV, 3GP, WEBM, JAR, MOBI, PDF, AIFF, FLAC, M4A, WAV, BMP, KEY, PPTX): Successfully clusters multimedia files sharing specific encoding and packaging formats.
Level_3_1_0_1 (JPG, DNG, TIFF, HEIC, PNG, MP4, AVI, MKV, OGV, APK, MSI, DMG, 7Z, MP3): Includes mainstream multimedia files, installation packages, and compressed files.
Level_3_1_0_2 (BZ2, DEB, GZ, PKG, RAR, RPM, XZ, ZIP, EXE, DOCX, XLSX, DJVU, EPUB): Groups compressed files and advanced document formats.
Level_3_1_0_3 (NEF, GIF, AI, PSD, PPT, OGG): Groups design files, image files, and audio files.
Merged_Leaf (GPR, PEF): Groups raw image formats with unique storage characteristics.

The clustering results are highly consistent with the inherent properties of the file types, clearly displaying the associations and feature similarities among categories. This also verifies the rationality of the proposed hierarchical clustering process: the three-level agglomerative clustering based on the 512-dimensional raw byte feature space and Ward’s linkage criterion can accurately capture the feature similarity between different file types, and the dynamic adjustment mechanism of leaf nodes (merge/split) further optimizes the category distribution of each leaf node. The resulting hierarchical structure not only reflects the intrinsic relationships between file types but also provides a reasonable grouping basis for the subsequent dedicated classifiers, which is the key to reducing the complexity of the classification task and improving the overall classification accuracy.

Table 1 presents evaluation results of our model on the FFT-75 dataset. Overall, some file types achieved very high classification accuracy, such as ARW, GPR, XLS, JSON, and XML, with 100% accuracy. Moreover, categories like JPG, TIFF, MKV, ELF, and DLL also exhibited accuracy rates exceeding 90%, demonstrating excellent performance.

However, significant classification challenges persist in specific file types:

HEIC (20%): As a high-efficiency image container, its high entropy characteristics (from advanced compression) and fragmented data distribution lead to misclassification, mainly as 7z (27.7%), followed by mp4 (19.1%) and avi (11.3%).
MP4 (21%): Combines compound structure (embedded video/audio/metadata) with high entropy (e.g., H.265 encoding), causing misclassification, mainly as 7z (28.4%), followed by avi (17.2%) and dmg (3.8%).
DOCX (21%): Compound document format with embedded multimedia elements (images, formatting tags), resulting in cross-type misclassification, mainly as epub (16.2%), followed by bz2 (10.9%) and gz (10.6%).
7Z (21%)/BZ2 (19%)/XZ (16%): Exhibit extreme high entropy (redundancy elimination via deep compression), resulting in severe misclassification; specifically, XZ is mainly misclassified as bz2 (23.1%), followed by djvu (10.0%) and docx (8.5%), while BZ2 is mainly misclassified as xz (24.9%), followed by docx (9.1%) and deb (5.7%).
EXE (18%): Windows executables’ complex structure (code/resources/strings) and high entropy drive misclassification, mainly as xz (27.2%), followed by bz2 (18.4%) and djvu (9.2%).

These categories showed poor classification performance, which may require further optimization of the model or feature extraction methods.

Table 2 compares our model with other methods on the FFT-75 dataset. We evaluated our method against state-of-the-art approaches (e.g., FiFTy [15], DS-CNN [24], and Byte2Image [27]) on FFT-75 Scenario #1 (512 bytes). All baseline methods in Table 2 were tested under the exactly identical experimental settings as adopted in our study. The baseline results cited are the officially reported values from their original studies, and we have reproduced all these baseline methods in the same experimental environment, with the difference between the reproduced accuracy and the originally reported values being less than 1%. Our model achieved an average accuracy of 76.3%, which is 10.7% higher than FiFTy, 10.4% higher than DS-CNN, and 5.3% higher than Byte2Image.

5. Conclusions

In this paper, we proposed a hierarchical classification structure based on the agglomerative hierarchical clustering algorithm, combined with deep learning methods, significantly improving the performance of file fragment classification. By introducing a dynamic adjustment mechanism, we ensured a reasonable category distribution at the leaf nodes, optimizing the stability and scalability of the hierarchical classification structure. Additionally, we built specialized classifiers for each leaf node and selected appropriate classification models for training based on node characteristics, achieving targeted classification optimization. Experimental results show that our method achieved state-of-the-art performance on the FFT-75 dataset and effectively alleviated misclassification issues caused by the large number of classification types. In addition, the experimental results on the FFT-75 dataset reveal obvious performance differences among different file type categories, and the root causes of such differences are closely related to the intrinsic characteristics of file types and their feature expressions in the 512-dimensional raw byte feature space, which also further reflects the practical adaptability of the proposed hierarchical classification method. For file types with highly structured binary characteristics (e.g., ARW, GPR, JSON, XML), their fixed header/footer features and regular byte distribution make their feature vectors in the clustering feature space have obvious distinguishability, so the classification accuracy reaches 100%. For file types with semi-structured binary characteristics (e.g., DLL, MP3, GIF), although there is partial feature overlap with similar types, the hierarchical clustering groups them with other similar types into the same leaf node with a reasonable category scale, and the dedicated classifier can effectively learn the fine-grained feature differences; therefore, the classification accuracy is maintained at 90% or above. For file types with high entropy and complex composite structure (e.g., XZ, EXE, MP4, HEIC), the random byte distribution caused by deep compression, the large intra-class feature difference caused by embedded multi-component content, and the serious feature overlap with similar formats make their feature distinguishability in the raw byte feature space extremely low; even though the P1 deep classifier is adopted for their affiliated leaf nodes, the fine-grained classification is still a challenging task, leading to relatively low classification accuracy.

Despite these achievements, our work still has certain limitations that need to be addressed:

1.: The dynamic adjustment mechanism has limited adaptability, currently only meeting the needs of conventional classification scenarios, and its performance in complex and dynamic classification scenarios needs improvement.
2.: There is room for optimization in the targeted nature and architectural efficiency of leaf node classifiers. Existing models do not fully adapt to the unique characteristics of data from various categories, and the exploration of efficient training strategies is insufficient.

In future work, we will further optimize the dynamic adjustment mechanism in the clustering process to better adapt to complex and dynamic classification scenarios and needs. For each leaf node, we will design more targeted classifiers and explore more efficient classifier architectures and training strategies, such as introducing more advanced deep learning models or combining multimodal features, to further improve classification performance. We also plan to validate our method on more diverse datasets to evaluate its generalizability and robustness and study how to intelligently handle class distribution imbalance. These improvements will contribute to advancing file fragment classification technology and provide stronger support for the digital forensics field.

Author Contributions

Conceptualization, B.Z. and H.L.; methodology, B.Z.; software, B.Z.; validation, B.Z.; formal analysis, B.Z. and H.L.; investigation, B.Z.; resources, B.Z.; data curation, B.Z.; writing—original draft preparation, B.Z. and H.L.; writing—review and editing, B.Z. and H.L.; visualization, B.Z.; supervision, H.L.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by “Cyberspace Security Discipline” and the APC was funded by “Nanjing Police University”.

Data Availability Statement

The datasets analyzed during the current study are available in IEEE Dataport, http://dx.doi.org/10.21227/kfxw-8084 (accessed on 29 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Beebe, N.L.; Maddox, L.A.; Liu, L.; Sun, M. Sceadan: Using concatenated n-gram vectors for improved file and data type classification. IEEE Trans. Inf. Forensics Secur. 2013, 8, 1519–1530. [Google Scholar] [CrossRef]
Wang, F.; Quach, T.T.; Wheeler, J.; Aimone, J.B.; James, C.D. Sparse coding for n-gram feature extraction and training for file fragment classification. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2553–2562. [Google Scholar] [CrossRef]
Veenman, C.J. Statistical disk cluster classification for file carving. In Third International Symposium on Information Assurance and Security; IEEE: New York, NY, USA, 2007. [Google Scholar]
Sportiello, L.; Zanero, S. File block classification by support vector machine. In Sixth International Conference on Availability; IEEE: New York, NY, USA, 2011. [Google Scholar]
Fitzgerald, S.; Mathews, G.; Morris, C.; Zhulyn, O. Using NLP techniques for file fragment classification. Digit. Investig. 2012, 9, S44–S49. [Google Scholar] [CrossRef]
Li, Q.; Ong, A.; Suganthan, P.; Thing, V. A novel support vector machine approach to high entropy data fragment classification. In Proceedings of the SAISMC 2010, Port Elizabeth, South Africa, 17–18 May 2010. [Google Scholar]
Bhat, K.; Lam, J.T.; Zulkernine, F. Content-based file type identification. In 2018 10th International Conference on Electrical and Computer Engineering (ICECE); IEEE: New York, NY, USA, 2018; pp. 277–280. [Google Scholar]
Ahmed, I.; Lhee, K.S.; Shin, H.J.; Hong, M.P. Fast content-based file type identification. In IFIP International Conference on Digital Forensics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Hanis, F.M.; Khoshvaghti, H.; Teimouri, M.; Veisi, H. A language-independent approach to classification of textual file fragments: Case study of Persian, English, and Chinese languages. In 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE); IEEE: New York, NY, USA, 2021; pp. 254–259. [Google Scholar]
Amirani, M.C.; Toorani, M.; Beheshti, A. A new approach to content-based file type detection. In 2008 IEEE Symposium on Computers and Communications; IEEE: New York, NY, USA, 2008; pp. 1103–1108. [Google Scholar][Green Version]
Ahmed, I.; Lhee, K.S.; Shin, H.; Hong, M. Content-based file-type identification using cosine similarity and a divide-and-conquer approach. IETE Tech. Rev. 2010, 27, 465–477. [Google Scholar] [CrossRef]
Sitompul, O.S.; Rahmat, R.F. Distributed autonomous Neuro-Gen learning engine for content-based document file type identification. In 2014 International Conference on Cyber and IT Service Management (CITSM); IEEE: New York, NY, USA, 2015. [Google Scholar]
Karampidis, K.; Papadourakis, G. File type identification-Computational intelligence for digital forensics. J. Digit. Forensics Secur. Law 2017, 12, 6. [Google Scholar] [CrossRef][Green Version]
Chen, Q.; Liao, Q.; Jiang, Z.L.; Fang, J.; Yiu, S.; Xi, G. File fragment classification using grayscale image conversion and deep learning in digital forensics. In 2018 IEEE Security and Privacy Workshops (SPW); IEEE: New York, NY, USA, 2018; pp. 140–147. [Google Scholar]
Mittal, G.; Korus, P.; Memon, N. FiFTy: Large-scale file fragment type identification using convolutional neural networks. IEEE Trans. Inf. Forensics Secur. 2020, 16, 28–41. [Google Scholar] [CrossRef]
Karres, M.; Shahmehri, N. File type identification of data fragments by their binary structure. In 2006 IEEE Information Assurance Workshop; IEEE: New York, NY, USA, 2006. [Google Scholar]
Calhoun, W.C.; Coles, D. Predicting the types of file fragments. Digit. Investig. 2008, 5, S14–S20. [Google Scholar] [CrossRef]
Masoumi, M.; Keshavarz, A.; Fotohi, R. File fragment recognition based on content and statistical features. Multimed. Tools Appl. 2021, 80, 18859–18874. [Google Scholar] [CrossRef]
Bhatt, M.; Mishra, A.; Kabir, M.W.U.; Blake-Gatto, S.E.; Rajendra, R.; Hoque, M.T.; Ahmed, I. Hierarchy-based file fragment classification. Mach. Learn. Knowl. Extr. 2020, 2, 216–232. [Google Scholar] [CrossRef]
Wang, Y.; Liu, W.Y.; Wu, K.J.; Yap, K.H.; Chau, L.P. Intra- and inter-sector contextual information fusion with joint self-attention for file fragment classification. Knowl.-Based Syst. 2024, 291, 111565. [Google Scholar] [CrossRef]
Alam, S.; Altiparmak, Z. Optimizing file fragment classification by mitigating class imbalance problem. In 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Park, J.G.; Liu, S.; Hong, J.H. XMP: A cross-attention multi-scale performer for file fragment classification. In 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2024; pp. 4505–4509. [Google Scholar]
Liu, S.; Park, J.G.; Kim, H.S.; Hong, J.H. A cross-attention multi-scale performer with Gaussian bit-flips for file fragment classification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 2109–2121. [Google Scholar] [CrossRef]
Saaim, K.M.; Felemban, M.; Alsaleh, S.; Almulhem, A. Light-weight file fragments classification using depthwise separable convolutions. In IFIP International Conference on ICT Systems Security and Privacy Protection; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 196–211. [Google Scholar]
Zhu, N.; Liu, Y.; Wang, K.; Ma, C. File fragment type identification based on CNN and LSTM. In Proceedings of the 2023 7th International Conference on Digital Signal Processing; Association for Computing Machinery: New York, NY, USA, 2023; pp. 16–22. [Google Scholar]
Wang, Y.; Wu, K.; Liu, W.; Yap, K.H.; Chau, L.P. Image representation and deep inception-attention for file-type and malware classification. In 2023 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Liu, W.; Wang, Y.; Wu, K.; Yap, K.H.; Chau, L.P. A byte sequence is worth an image: CNN for file fragment classification using bit shift and n-gram embeddings. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Mittal, G.; Korus, P.; Memon, N. File Fragment Type (FFT)-75 Dataset [EB/OL]. 2019. Available online: https://ieee-dataport.org/open-access/file-fragment-type-fft-75-dataset (accessed on 29 March 2026).

Figure 1. Hierarchical category tree structure with dynamic adjustment mechanism. The tree consists of three levels: Level 1 divides all categories into 2 broad clusters; Level 2 further subdivides into 2–3 sub-clusters; Level 3 produces leaf nodes with 4–15 categories each. Nodes with insufficient categories (<4) are merged, while oversized nodes (>15) are split.

Figure 2. Architecture of the P1 classifier (deep variant). The network consists of an embedding layer (64-dim), four convolutional blocks (Conv1D + LeakyReLU + MaxPool), global average pooling, dropout (rate 0.2), and fully-connected layers with softmax output.

Figure 3. Architecture of the P2 classifier (shallow variant). The network consists of an embedding layer (64-dim), two convolutional blocks (Conv1D + LeakyReLU + MaxPool), global average pooling, dropout (rate 0.1), and fully-connected layers with softmax output.

Table 1. Evaluation results on FFT-75 dataset (scenario #1, 512 bytes).

No.	Classes	Acc. (%)	No.	Classes	Acc. (%)	No.	Classes	Acc. (%)
1	arw	100	26	nef	98	51	pptx	69
2	gpr	100	27	mkv	98	52	apk	68
3	nrw	100	28	doc	98	53	key	68
4	pef	100	29	m4a	98	54	pdf	65
5	raf	100	30	mobi	97	55	jar	63
6	3fr	100	31	txt	97	56	djvu	58
7	eps	100	32	jpg	96	57	flac	58
8	xls	100	33	tiff	96	58	rpm	48
9	json	100	34	psd	96	59	avi	46
10	xml	100	35	mach-o	96	60	dmg	44
11	log	100	36	ogg	96	61	zip	40
12	csv	100	37	elf	94	62	gz	36
13	aiff	100	38	pcap	94	63	mov	34
14	wav	100	39	xlsx	93	64	msi	34
15	wma	100	40	orf	92	65	epub	29
16	ttf	100	41	cr2	91	66	deb	26
17	dwg	100	42	gif	90	67	pkg	25
18	rw2	99	43	3gp	90	68	rar	23
19	dll	99	44	bmp	85	69	mp4	21
20	md	99	45	ppt	84	70	7z	21
21	rtf	99	46	ogv	82	71	docx	21
22	tex	99	47	dng	81	72	heic	20
23	html	99	48	ai	78	73	bz2	19
24	mp3	99	49	webm	71	74	exe	18
25	sqlite	99	50	png	70	75	xz	16

Table 2. Comparison with other methods on FFT-75 dataset (scenario #1, 512 bytes).

Method	Feature	Accuracy (%)
FiFTy [15]	Raw data	65.6
DS-CNN [24]	Raw data	65.9
Byte2Image [27]	Intra-byte n-grams	71.0
Ours	Raw data	76.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zou, B.; Liu, H. Hierarchical Deep Learning for File Fragment Classification. Electronics 2026, 15, 1507. https://doi.org/10.3390/electronics15071507

AMA Style

Zou B, Liu H. Hierarchical Deep Learning for File Fragment Classification. Electronics. 2026; 15(7):1507. https://doi.org/10.3390/electronics15071507

Chicago/Turabian Style

Zou, Bailin, and Huiyi Liu. 2026. "Hierarchical Deep Learning for File Fragment Classification" Electronics 15, no. 7: 1507. https://doi.org/10.3390/electronics15071507

APA Style

Zou, B., & Liu, H. (2026). Hierarchical Deep Learning for File Fragment Classification. Electronics, 15(7), 1507. https://doi.org/10.3390/electronics15071507

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Deep Learning for File Fragment Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Hierarchical Clustering Strategy

3.2. Leaf Node Optimization

3.3. Neural Network Classifiers

3.4. Architecture Variants

4. Experimental Evaluation

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Implementation Details

4.2. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI