Data Deduplication System Based on Content-Deﬁned Chunking Using Bytes Pair Frequency Occurrence

: Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-deﬁned chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an e ﬃ cient ﬁngerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the e ﬃ ciency of the proposed system, extensive experiments were done using two di ﬀ erent datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is ﬁve times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reﬂection on the system e ﬃ ciency compared to other deduplication systems.


Introduction
The amount of digital data is rising explosively, and the forecasted amount of data to be generated by the end of 2020 is about 44 zettabytes. Because of this "data flood," storing and maintaining backups for such data efficiently and cost-effectively has become one of the most challenging and essential tasks in the big data domain [1][2][3]. Enterprises, IT companies, and industries need to store and operate on an enormous amount of data. The big issue is how to manage these data. To manage data in a proper way, data deduplication techniques are used. Dropbox, Wuala, Mozy, and Google Drive employ deduplication techniques to reduce cloud storage capacity and utilize cloud storage more appropriately. Data deduplication is becoming a dominant technology to reduce the space requirement for both primary file systems and data backups [4,5]. It is an effective data reduction 1.
The proposed system defines chunk boundaries based on the bytes frequency of occurrence instead of the byte offset (as in the fixed-size chunking technique), so any change in one chunk will not affect the next one, and the effect will be limited to the changed chunk only.

2.
Content-defined chunking consumes substantial processing resources to calculate hash values using SHA1 or MD5 for data fingerprinting, while the proposed system uses mathematical functions to generate three hashes that consume fewer computing resources. Furthermore, compared to the traditional TTTD method, the number of bits needed to store these three hashes is 48 bits, which is less than the number of bits needed to save the hash value in SHA1 (160 bits) and MD5 (128 bits).
The details of the proposed system will be presented in the remainder of the paper. Section 2 illustrates works related to data deduplication. In Section 3, system methodology is discussed. In Section 4, the proposed system is described. In Section 5, the results of our suggested method are discussed. The last section provides conclusions and addresses future works.

Related Work
Data deduplication systems have been subject to intensive research for the last few years. They generally detect redundant objects by comparing calculated fingerprints rather than comparing byte by byte and transparently eliminating them [14]. At present, researchers mainly focus on improving data chunking algorithms to increase the deduplication elimination ratio (DER). Bin Zhou et al. [15] explained that most current chunking algorithms use the Rabin algorithm sliding window for the chunking stage, which utilizes a large amount of CPU resources and leads to performance issues. Accordingly, they proposed a new chunking algorithm, namely, the bit string content aware chunking strategy (BCCS), which calculates the chunk's fingerprint using a simple shift operation to reduce the resource utilization. Venish and Sankar [10] discussed different chunking methods and algorithms and assessed their performances. They found that an effective and efficient chunking algorithm is crucial. If the data are chunked precisely, it increases the throughput and the deduplication performance. Compared to file-level chunking and fixed-size chunking, the content-based chunking approaches deliver good throughput and reduce space consumption.
Wang et al. [16] presented a logistically based mathematical model to enhance the DER based on the expected chunk size, as the previously proposed 4 KB or 8 KB chunk size did not provide the best optimization for the DER. To validate the correctness of the model, they used two realistic datasets, and according to the results, the R2 value was above 0.9. Kaur et al. [17] presented a comprehensive literature review of existing data deduplication techniques and the various existing classifications of deduplication techniques that have been based on cloud data storage. They also explored deduplication techniques based on text and multimedia data along with their corresponding classifications, as these techniques have many challenges for data duplication detection. They also discussed existing challenges and significant future research directions in deduplication. Zhang et al. [18] proposed a new CDC algorithm called the asymmetric extremum (AE), which has higher chunking throughput and smaller chunk size variance than the existing CDC algorithms and an improved ability to chunk boundaries in low-entropy strings. The system shows an enhancement in performance and reduces the bottleneck of chunking throughput while maintaining deduplication efficiency. Nie et al. [19] developed a method to optimize the deduplication performance by analyzing the chunking block size to prevent blocks that are too large or too small, which affects the data deduplication efficiency. It was proved that selecting the optimum block size for the chunking stage will improve the data deduplication ratio. Fan Ni [20] quantified the impact of the existing parallel CDC methods on the deduplication ratio, and proposed a two-phase CDC method (SS-CDC) that can provide substantially increased chunking speed, as can regular, parallel CDC approaches, and achieve the same deduplication ratio as the sequential CDC method does. An opportunity was identified in a journaling file system where fast non-collision-resistant hash functions can be used to generate weak fingerprints for detecting duplicates, thus avoiding the write-twice issue for the data journaling mode without compromising data correctness and reliability. Xia et al. [21] proposed a fast CDC approach for data deduplication. The main idea behind it is the use of five key techniques, namely, a gear-based fast rolling hash, optimizing the gear hash judgment for chunking, subminimum chunk cut-point skipping, normalized chunking, and two-byte rolling. The experimental results proved that the proposed approach gains a chunking speed that is about 3-12× higher than the state-of-the-art CDC, while nearly accomplishing the same or a higher deduplication ratio with respect to the Rabin-based CDC. Taghizadeh et al. [22] developed an intelligent approach for data deduplication on flash memories. Based on data content and type, there was a classification for the write requests, and the metadata for it was stored as separate categories to enhance the search operation, resulting in an improvement in the search delay and enhancing the deduplication rate considerably.
The core enabling technologies of data deduplication are chunking and hashing. The performance bottleneck caused by the computation-intensive chunking and hashing stages of data deduplication presents a significant challenge. Most of the above research worked on optimizing these two stages, as minimizing the chunking and hashing overhead are becoming increasingly urgent aims for deduplication. In this paper, we introduce a method for optimizing the performance of the deduplication system by improving the relevant key factors in CDC to find the optimal chunk cut-points and endorse a new hash function to generate a chunk's fingerprint, which significantly reduces the system's computational overhead.

Methodology
Data deduplication is needed to eliminate redundant data and reduce the required storage. It has been gaining increasing attention and has been evolving to meet the demand for storage cost saving, enhanced backup speed, and a reduced amount of data transmitted across a network. In this paper, we developed a CDC algorithm based on the frequency of bytes occurrence and a new hashing algorithm based on a mathematical triple hashing function.

Content-Defined Chunking
Content-defined chunking (CDC) is data deduplication chunking technology that sets the chunking breakpoint based on the content of the data, when a predefined breaking condition becomes true. It is used to eliminate the boundary shifting problem that fixed size chunking (static chunking) suffers from, as any modification to the stream of data by adding or removing one byte will lead to the generation of a different set of chunks, and these chunks will have different hash values, so it will be considered new data and will affect deduplication efficiency.
The most common systems of content-defined chunking by deduplication are as follows.

Basic Sliding Window (BSW) (Usually Known as Rabin CDC)
This is one of the legacy CDC chunking algorithms that use a fingerprint hashing function (Rabin fingerprint) as a breaking condition after setting three parameters that will be used for chunking (the sliding window size, the D-divisor, and the R-remainer, where D > R).
The main drawback of this algorithm is that, for each shifting in the window, a fingerprint will be generated and a condition will be checked (fingerprint mod D = R?) to set the chunk boundary. Such a calculation will consume an enormous amount of computational resources (rolling hash computation overhead), which affects the deduplication efficiency and throughput. Moreover, the D is a predefined average chunk size and not generated from the dataset itself, which leads to high chunk size variance [23]. Figure 1 illustrates the concept of the BSW algorithm.

Two Threshold Two Divisors Chunking (TTTD)
This algorithm uses the same concept as BSW, but introduces four new parameters: the minimum size threshold (T min ), which is used to eliminate very small chunks, the maximum size threshold (T max ), used to eliminate very large chunks, the main divisor (D), and the secondary divisor (D'), which is half the value of the main divisor and used to find the breakpoint if the algorithm fails to find it with the main divisor.
The drawback of this algorithm is that the main divisor value is an estimated value and not related to the content of the data. In addition, the algorithm will not use the secondary divisor until reaching the T max threshold, and in most cases the breakpoint found by the secondary divisor is very close to the T max threshold. Such unnecessary calculations and comparisons are computationally expensive and affect the performance of the algorithm [23].

The Proposed Bytes Frequency Based Chunking (BFBC)
A new CDC algorithm depends on content analysis using statistical models generating a histogram of frequencies of occurrence of pair-bytes (pair-bytes distribution analysis) in the dataset to build a set of divisors to be used as chunking breaking points, as shown in Algorithm 1, which describes the chunking algorithm.

Algorithm 1 Chunking algorithm
Objective: Divide the file into chunks based on T min , List of Divisors, and T max Input: File of any type or size T min , T max , List of Divisors Output: Number of variable sized chunks Step1: Set Breakpoint ← 0, Pbreakpoint ← 0 Step2: Read File as array of bytes Set Length ← File size, Set Full Length ← File size Step3: If length equals zero Go to Step2 Step4: If Length <= T min Consider it as a chunk and send it to the Triple hash function, Algorithm (2) Length = Full Length-chunk length Step5: If Length between T min and T max Search for the pair divisors bytes from breakpoint + T min until Full Length If found Consider the chunk from Pbreakpoint until the divisor byte position Send the chunk to the Triple hash function, Algorithm (2) Breakpoint = chunk length +1 Length = Full length − breakpoint Step6: If Length equals T max Consider the chunk from Pbreakpoint until Full Length Send the chunk to the Triple hash function, Algorithm (2) Breakpoint= chunk length +1 Length = Full Length-breakpoint Step7: If Length > T max Search for the pair divisors bytes from breakpoint + T min until Pbreakpoint + T max If found Consider the chunk from Pbreakpoint until the breakpoint Send the chunk to the Triple hash function, Algorithm (2) Pbreakpoint = breakpoint Length = Full Length-breakpoint Step8: Go to step 3 Both BSW and TTTD suffer from the degradation of the chunking throughput due to Rabin's rolling hash function performance bottleneck. BFBC significantly improves chunking throughput and provides better deduplication efficiency by using a list of divisors generated from the statistical characteristics of the dataset itself, without the need to calculate the Rabin fingerprint for each cut-point judgment.

Triple Hashing Algorithm
Using the linear bounded sum of a string of non-repeatable zero bytes multiplied by a random sequence of numbers can produce hash values to be used as the chunk's fingerprint, as shown in Algorithm 2, because using three hashes to represent the chunk content can produce a combined fingerprint that has smaller hit rates of collision. The use of a linear sum of multiplication followed by a bounding operation has a low computational cost in comparison with the computational complexity of security hash functions (e.g., SHA1 and MD5).
The length of the proposed hash values to represent chunk fingerprints is very low (16 bits for each used hash function; a total of 48 bits for the three hashes), which is very small compared to the length of SHA1 (160 bits) or MD5 (128 bits); this will reduce the overhead information required to represent the hashing table. Below is Algorithm 2 for the proposed triple hashing function.
A full comparison between the proposed hashing function and the commonly used hashing functions (SHA1 and MD5) by other deduplication systems will be described and evaluated in the experimental section using hashing throughput and storage saving benchmarks.

Experimental Datasets
Two datasets with different characteristics were used to test system performance and efficiency. The first dataset consists of different versions of Linux source codes [24], from the Linux Kernel Archives, while the second dataset consists of 309 versions of SQLite [25]. Table 1 shows the characteristics of the used datasets.

The Proposed System
The proposed system focuses on a new CDC technique, used to produce chunks of variable size and determine chunk boundaries in the content by threshold breakpoints. Being aware of the problems in the CDC algorithm, in this paper, we propose a novel bytes frequency-based chunking (BFBC) technique. The BFBC technique reduces chunk size variation and maintains the DER and deduplication performance. It employs a sampling technique to determine the chunk boundaries based on the data content. The algorithm moves through the data stream byte by byte. If a data block satisfies certain pre-defined conditions (Divisors or T max ), it will be marked as a chunk cut-point. The data block between the current cut-point and its previous cut-point forms a resultant chunk. The BFBC technique consists of the following stages and is illustrated in Figure 2:

Load the Dataset
In this stage, the system will start reading files from the dataset one by one, and then process these files as an array of bytes, preparing it to be scanned in the next stage.

Divisors Analysis and Selection
Byte distribution analysis is statistical analysis whereby a binary file is examined in terms of its byte constituents, since each dataset typically contains some bytes that are used constantly. This part of the proposed system was built to compute the frequency of the pair of bytes in the input file. The function will count the frequency of pairs of bytes in a file, i.e., how many times each pair of bytes is presented in the file to return maximum occurring pairs in the dataset. The function will traverse the given dataset byte by byte and store the frequencies or the number of times each pair occurs in a dataset. The function will then sort the output to find the list of pairs in descending order, so that the top pairs of bytes can be selected for the next stage of the chunking process. Table 2 shows the list of the top 10 pairs (divisors) generated by the divisors analysis and selection stage for Datasets 1 and 2, sorted in descending order. The number of divisors that yield the best DER will be discussed in the next section.

BFBC Chunking
The main task of this phase is to partition the input file (stream of bytes) into small and non-overlapped chunks using the new BFBC technique. It will determine the chunk boundary or the breakpoint by a certain group of divisors, which depend on the contents of the dataset. It is based on the characteristic of the original file as a data container and defines the boundaries that are based on the occurrence of pairs of bytes in the dataset. The condition is used in BFBC either to find one of the listed multi divisrs D (generated in the previous stage) after T min or to reach the T max threshold value if determination of the breakpoint failed using the divisors conditions. BFBC guarantees a minimum and maximum chunk size. Minimum and maximum size limits are used to split a file into chunks when D is not found. There are three main parameters that need to be pre-configured: the minimum threshold (Tmin), the list of divisors (D), and the maximum threshold (T max ), as described in Table 3. Table 3. The purpose of chunking parameters.

Parameter Purpose
T min (minimum threshold) To reduce the number of very small chunks Divisors (D) To determine breakpoints T max (maximum threshold) To reduce the number of very large chunks When the chunking method depends on the format of the file, the deduplication method can provide the best redundancy detection ratio compared with the fixed-size and other variable-size chunking methods.
Chunking steps are as follows: a. Scan the sequence of bytes starting from the last chunk boundary and apply a minimum threshold to the chunk sizes (T min ). b.
At each position after T min , check the current pair of bytes with the list of divisors to look for matches. c.
If a D-match is found before reaching the threshold T max , use that position as a breakpoint. d.
If the search for a D-match fails and the T max threshold is reached (without finding a D-match), use the current position (threshold T max ) as a breakpoint.
Tail chunks result for the following reasons: The size of the file is smaller than T min ; The fragment, from the last breakpoint to the end of the file, is smaller than T min ; The algorithm cannot find any breakpoint from the last breakpoint to the end of the file, even if the size of the chunk is larger than T min . Figure 3 shows the chunk types generated by the chunking stage. Figure 4 illustrates the chunking technique used in the proposed system.

Triple Hashing, Indexing, and Matching Stage
Legacy deduplication systems suffer from high computational power and disk space requirements and waste time solving the collision problem. In this work, a new hashing technique is proposed to save resources and reduce processing time.
The hashing part of the system uses a new simple hashing function to compute three hash values for each chunk. Therefore, each chunk produced from the chunking stage is sent to the proposed hash function to generate three hash values to describe the contents of chunks. Each hash is generated by a mathematical function with a size of 16 bits. The total number of bits needed to store these three hashes is 48. Traditional hashing functions (SHA1 and MD5) used by other content-defined chunking methods consume a substantial amount of processing resources to calculate hash values, and they utilize much more storage space to store these hash values. The number of bits needed to store our proposed hashes is less than the number of bits needed to save the hash value using SHA1 (160 bit) and MD5 (128 bit).
For each chunk, a chunk ID is created, which includes chunk size, divisor type, and the three generated hashes. To find the duplicated chunks, if the chunk size, divisor type, and the first hash of the two compared chunks match, then the second hash of the two chunks are compared, followed by the third. This cascade comparison will reduce the time needed to compare the chunks by eliminating the byte-to-byte comparison that is needed to compare chunks. If they are found to be identical, the chunk's pointer in the metadata table is updated to the already-existing chunk by incrementing the chunk reference count for that chunk, and the new (duplicate) chunk is released. Otherwise, the new (non-duplicated) chunk is saved in the unique data container, its chunk reference is saved in the metadata table, and the three hashes are saved in a temporary hash index table for further processing. Figure 5 shows the steps of this stage.
After deduplication, three types of data are stored: • Unique data. Non-duplicated chunks are produced by the algorithm. • Metadata. To rebuild the dataset again, address information related to the chunks needs to be stored. Regardless of whether chunks contain unique data or not, the related metadata needs to be available for future retrieving stage. Accordingly, the total metadata number equals the total number of chunks.

•
Hash index. Hash values for each non-duplicated chunk are stored as the chunk's fingerprint for use in detecting duplicated chunks. Each record within the hash index table is considered as a chunk ID. Table 4 shows an example of a hash index table.

Experimental Setup
We built the deduplication system from scratch including chunking, hashing, and matching stages using C# language. The configuration of the computer used for the experiment is described as follows: CPU: Intel Core i7-3820QM @ 2.70 GHz 4-core processor; RAM: 16 GB DDR3; Disk: 1TB PCIe SSD; Operating system: Windows 10 64-bit.

Experimental Results
In this section, we will analyze some properties of the algorithm, mainly, chunk size, divisors selection, and the proposed hashing function effects. This section describes our evaluation of storage space saving and describes the relationship between the storage saving and the two chunking parameters (no. of divisors and chunk size). We hope to propose a more precise approach to improve the deduplication ratio. It is shown that the ratio of the storage capacity can be reduced if data deduplication techniques are applied.

Choosing the Optimum Chunk Size (Optimizing Storage Saving by Setting the Expected Chunk Size)
Chunk size has a direct impact on the deduplication ratio. Chunk size needs to be optimized to achieve the best results. Small T min /T max values detect more duplicated chunks, but they affect the metadata size. Large T min /T max values decrease the deduplication ratio because they produce large chunks, which in turn further reduces redundant data detection. Table 5 shows the dataset size after deduplication for different values of T min /T max . The experimental results are presented in Figure 6. The x-axis represents the chunk size, and the y-axis represents the dataset size after deduplication. As chunk size increases (T min /T max ), the deduplication ratio gradually degrades. The results show that storage size after duplication elimination is optimal by setting the chunk size between 128 and 512 bytes for both datasets. Deduplication at that chunk size reduced 5.93 GB of data to 1.19 GB in Dataset 1 and 6.44 GB to 214.9 MB in Dataset 2. Given the significant advantage shown of small block sizes, the results illustrate why T min /T max plays an important role in deduplication.

Chunk Distribution Based on the Number of Divisors
The efficiency of the proposed system is based on the ratio of chunks, which was generated using the list of proposed divisors. Figure 7 and Table 6 show the chunk distribution based on the number of divisors in Dataset 1.   Table 7 shows that most chunks were generated using divisors, while the ratio of the chunks generated using T max are minimal. According to the results of our experiments, we found that more than 98% of the total chunks were determined by the divisor when the chunk size was 128-512 bytes, which clearly shows the efficiency of the propsoed chunking algorithm.

The Impact of the Number of Divisors Selected on the Data Duplication Ratio
In this subsection, we present results from the evaluation of our deduplication technique based on experiments and analysis. The space savings achievable with deduplication are shown to indicate the usefulness of deduplication with respect to our target workload. Figure 8 shows the deduplication behavior when selecting a different number of divisors. The x-axis represents the number of divisors selected, while the y-axis represents the storage size after deduplication. To determine the optimum number of divisors for the experiments, the proposed system was tested using a different range of divisors. To obtain a higher deduplication ratio, we selected an appropriate number of divisors.
After examining the pair bytes occurrence (divisors), we found that a 128-512 chunk size will yield the best deduplication ratio, where the number of pairs is between 3 and 10. According to Figure 8, when the number of divisors is 4, we yielded the smallest total size after deduplication (highest storage gain) for both datasets. Figure 9 shows that the storage gain varies among different numbers of divisors. Accordingly, the storage gain after deduplication (deduplication space savings) reached 76.98% for Dataset 1 and 96.76% for Dataset 2 when the number of divisors was four.  Table 8 shows a sample of the number of duplicate chunks in Dataset 1 for each chunk size, with specific hash values in terms of the number of references to each block in the file system after deduplication. At the very peak, some chunks were duplicated more than 4000 times in Dataset 1. Each of these chunks individually represents an enormous amount of space that was wasted storing duplicate data. Overall, these data serves to show the possibility of space savings from deduplication. analysis of the dataset content to discover maximum redundancy and will find the duplicated data faster than other approaches. According to the results, BFBC is about 10 times faster than BSW and three times faster than TTTD, which leads to a significant increase in chunking throughput, as shown in Table 9 and Figures 10 and 11.
Chunking Algorithm Throughput = chunking Computational overhead = Processed Data in MB Time in Second (1)

The Impact of the Proposed Hashing Function
A. Impact on Storage Utilization: The proposed mathematical triple hash function to generate the chunk's fingerprint has a direct impact on storing the hashes in the index table. Each fingerprint requires 48 bits (6 bytes), while traditional hash functions MD5 and SHA1 require 128 bits (16 bytes) and 160 bits (20 bytes), respectively. Table 10 and Figure 12 show the impact of the hash algorithms on the storage size required for storing fingerprints in the index table computed by Equation (2).  B. Impact on Computational Overhead: The proposed triple hash algorithm uses a simple mathematical equation compared with the traditional hashing functions (SHA1 and MD5) used by other content-defined chunking, which consumes substantial processing resources and leads to heavy CPU overhead when calculating hash values. Table 11 and Figures 13 and 14 show the impact of hash algorithms on the hashing stage time and throughput, computed by Equation (3).   According to the results below, for both datasets, the proposed algorithm requires the least hashing time, which leads to a better throughput compared to the traditional hashing algorithms.
Hashing Algorithm Throughput = Hashing Computational overhead = Processed Data in MB Time in Second

Data Size after Deduplication and the Deduplication Elimination Ratio (DER)
The performances of the BSW, TTTD, and our proposed solution were compared in terms of size after deduplication and the DER (computed by Equation (4)). Results presented in Table 12 and Figures 15 and 16 clearly show that the proposed chunking alogrithm provides increased storage saving and an improved DER compared with other deduplication methods.

Deduplication Elimination Ratio (DER) =
Input Datas size be f ore Dedulplication in MB Output Data size a f ter Deduplication in MB (4)   The objective of our experiments is to compare the performance of the proposed BCFB chunking algorithm with that of the BSW and TTTD chunking algorithms, and compare the proposed triple hashing algorithm with SHA1 and MD5. BFBC was shown to effectively improve the deduplication throughput performance and, with the help of the triple hashing function, reduce computation time dramatically, as shown in the experiment results.

Conclusions and Future Work
In this paper, a combination of two new approaches for chunking and hashing algorithms provides impressive storage efficiency by reducing space usage and optimizing CPU throughput. The first approach is the Bytes Frequency-Based Chunking (BFBC), which made use of bytes frequency occurrence information from the dataset to improve data deduplication gain. This is based on the statistical bytes frequency, which indicates highly frequent pairs of bytes within the dataset. We demonstrated that our proposed approach can make full use of the list of divisors, which is the core component that, by designing a break condition for cut-point identification using a list of predefined divisors, enhances the performance of the chunking algorithm. The second approach is the proposed triple hash algorithm using a mathematical function that generates short fingerprints, which has a direct impact on the index table size and hashing throughput.
The experimental results show that chunking, using a list of divisors generated based on the content of the dataset, and setting the expected chunk size thresholds (T min -T max ) to 128-512 bytes in the BFBC chunking algorithm can effectively improve the deduplication throughput performance. BFBC is 10 times faster than BSW and approximately three times faster than TTTD, and the proposed triple hash function is five times faster than SHA1 and MD5.
However, there are possible limitations. First, system efficiency may be affected if the dataset content has a low ratio of similarity (e.g., contains a high number of compressed images or audio files); in this case, systems will face performance degradation due to the enormous variance in the content of the dataset. Another limitation is that running the system using a large dataset, to reduce the possibility of hash collision, will require an increase in the size of the hashes that represent the fingerprint; this will increase the hash index table size and computational overhead.
In the future, we will study the option of building an automated method that generates the list of divisors based on the percentage of cumulative pair bytes occurrence (e.g., that automatically generates a list of divisors based on a 20% cumulative occurrence of pair bytes) and analyses system behavior when triplets of bytes (or larger byte groups) are used instead of pairs of bytes in the divisors analysis and selection stage.
Author Contributions: The model was proposed by A.S.M.S., who collected experimental data and performed the analysis; L.E.G. provided guidance for this paper and was the research advisor. All authors have read and agreed to the published version of the manuscript.