Practical Challenge of Shredded Documents : Clustering of Chinese Homologous Pieces

When recovering a shredded document that has numerous mixed pieces, the difficulty of the recovery process can be reduced by clustering, which is a method of grouping pieces that originally belonged to the same page. Restoring homologous shredded documents (pieces from different pages of the same file) is a frequent problem, and because these pieces have nearly indistinguishable visual characteristics, grouping them is extremely difficult. Clustering research has important practical significance for document recovery because homologous pieces are ubiquitous. Because of the wide usage of Chinese and the huge demand for Chinese shredded document recovery, our research focuses on Chinese homologous pieces. In this paper, we propose a method of completely clustering Chinese homologous pieces in which the distribution features of the characters in the pieces and the document layout are used to correlate adjacent pieces and cluster them in different areas of a document. The experimental results show that the proposed method has a good clustering effect on real pieces. For the dataset containing 10 page documents (a total of 462 pieces), its average accuracy is 97.19%.


Introduction
Shredded document recovery is a complicated and challenging problem that has been studied by many researchers.Paper documents are separated into large numbers of pieces when they are shredded.These pieces are highly similar and present chaotic sequences, thereby increasing the difficulty of document recovery.Shredded document recovery has important research value, and the relevant findings can be extensively applied in several fields, such as information security [1], judicial investigations [2], and archaeological research [3].
Shredded document recovery is a complicated non-deterministic polynomial-hard problem [4].The recovery task includes several steps, and piece clustering is one of the key steps [5].As the number of pieces increases, the difficulty of document recovery also increases [6].In clustering, a large number of pieces are grouped into several clusters, and the pieces in the cluster are processed together, thereby reducing the difficulty of piece searching and improving the accuracy of piece matching [7].Because of the high similarity between shreds, piece clustering is difficult.
Research on piece clustering can be divided into two categories.The first piece clustering category is based on a single-page document.Wang et al. [8] considered piece clustering according to the distribution feature of a text line and assessed cluster validity by the matching proportion method.Sleit et al. [9] treated the clustering operation as a part of document reconstruction itself and used the cost function for piece matching and clustering.Richter et al. [10] utilized multimodal features, including shape, context, etc., to combine clusters and assemble shredded documents.Lei [11] used line information to cluster pieces from the same line.Similarly, Guo et al. [12] presented a row clustering method for shreds.
The second piece clustering category is based on multi-page documents.Ukovich et al. [5] employed a 12-dimension feature that included line spacing and paper/ink color.Based on the features, virtually shredded pieces from different files are clustered using hierarchical clustering.Schoier [13] used only the text line position as the feature to cluster pieces from multi-page documents, which have distinctly different page setups.Diem et al. [14] used several methods (color analysis, paper type analysis, and classification of the text) to cluster pieces from different sources.Chanda et al. [15] employed clustering as a preprocessing step for piece forensics and analyzed paper color and background texture to achieve piece clustering from different files.Liu et al. [16] proposed a spectral clustering algorithm that is based on the contour and color distribution of pieces, and several photos shredded by hand were clustered.Lalitha et al. [17] applied the shape information of pieces as the matching feature, clustered the different pages shredded by hand and reassembled the pieces.
Differing from the first study that focused on a few of pieces and used the clustering idea to achieve pieces matching and splicing.The second study must solve the problem of piece clustering, in which numerous pieces from different documents are mixed.Because of its highly applicable value, the second study is a hotspot in the current research of piece clustering.Although some achievements have been made in the second study, it focuses on the pieces that have distinct differences regarding page format (character size and lines spacing), appearance (paper color and piece shape), or content (writing style).These visual differences are very helpful for clustering.However, a real file usually has a unified document format, in which all pages of the document must present consistent paper color, character size, and text line spacing to satisfy people's reading habits [18].When the file is shredded, the produced pieces are highly similar regarding page format, appearance, and content.Because these pieces are derived from the same file, we refer to them as homologous pieces, as shown in Figure 1.Unlike the study objects in existing research, this paper addresses homologous pieces with a similar appearance.Distinguishing the pieces from different pages is difficult.Minimal differences among homologous pieces are observed, which hinders clustering using the features proposed in previous studies.Due to the ubiquitous nature of homologous pieces (they are extensively distributed in shredded documents), research on homologous piece clustering is significant to real shredded document recovery.Because China is an influential country and the use of Chinese is extensive, millions of Chinese documents are produced every year; thus, Chinese shredded document recovery is in huge demand.Therefore, this paper focuses on the problem of Chinese homologous piece clustering.
The contributions of this paper are as follows. 1.
In contrast to existing studies, this paper addresses homologous pieces that have unified page format and high similarity with regard to content and appearance.As a result, clustering is very difficult.Since homologous pieces are prevalent in shredded documents and there is a high demand for recovery of these pieces, the study of this paper has important practical significance.

2.
Because a document page includes only one leftmost piece and one rightmost piece, this paper can calculate the number of pages by recognizing the leftmost and rightmost pieces, thereby establishing a basis for obtaining the optimal clustering number.3.
By determining the correlations between characters and between characters and blank spaces in adjacent text lines, this paper matches the leftmost piece with the rightmost piece from the same page, thereby providing a good starting point for piece clustering.4.
Proceeding from the document, which is the source of the pieces, we propose a method of piece clustering that is based on the document layout.This method distinguishes pieces by the area in the document to which they belonged and uses the correlations between shreds to achieve effective clustering.
The remainder of this paper is organized as follows.In Section 2, the method of Chinese homologous piece clustering is presented, and the entire process of piece clustering is described in detail.In Section 3, the experimental results are discussed, and Section 4 presents the conclusions.The remainder of this paper is organized as follows.In Section 2, the method of Chinese homologous piece clustering is presented, and the entire process of piece clustering is described in detail.In Section 3, the experimental results are discussed, and Section 4 presents the conclusions.

Clustering Method for Chinese Homologous Pieces
The clustering of Chinese homologous pieces involves grouping the pieces that originally belonged to the same page.This paper illustrates the clustering process from several aspects, including the clustering number, the starting point of the clusters, and the clustering calculation.Moreover, this paper addresses strip-cut shredded pieces [19] formed from the shredded paper document by a shredder.These pieces are produced by the document being cut vertically by a shredder rather than horizontally or obliquely.The documents processed in this paper are Chinese documents.And these documents are the common printed documents in office, rather than handwritten documents which have different writing styles.Moreover, the documents processed in this paper are common single-sided documents, and double-sided documents are not within the scope of this paper.

Clustering Number
A validity problem for clustering is obtaining the optimal clustering number [6], which has a considerable influence on the clustering results.To determine the optimal clustering number of homologous pieces, we assume that all the pieces are present and then adopt the method proposed in [20] to encode the shreds.First, a piece is vertically divided into a series of blocks with the same size, as shown in Figure 2. Second, these blocks are transformed into the corresponding graphic types

Clustering Method for Chinese Homologous Pieces
The clustering of Chinese homologous pieces involves grouping the pieces that originally belonged to the same page.This paper illustrates the clustering process from several aspects, including the clustering number, the starting point of the clusters, and the clustering calculation.Moreover, this paper addresses strip-cut shredded pieces [19] formed from the shredded paper document by a shredder.These pieces are produced by the document being cut vertically by a shredder rather than horizontally or obliquely.The documents processed in this paper are Chinese documents.And these documents are the common printed documents in office, rather than handwritten documents which have different writing styles.Moreover, the documents processed in this paper are common single-sided documents, and double-sided documents are not within the scope of this paper.

Clustering Number
A validity problem for clustering is obtaining the optimal clustering number [6], which has a considerable influence on the clustering results.To determine the optimal clustering number of homologous pieces, we assume that all the pieces are present and then adopt the method proposed in [20] to encode the shreds.First, a piece is vertically divided into a series of blocks with the same size, as shown in Figure 2. Second, these blocks are transformed into the corresponding graphic types using a classifier based on five types of graphical Chinese characters, as shown in Figure 3. Finally, the piece is represented as a digital sequence.An analysis of the digital number distribution in the pieces showed that in a page of a document, the type-1 and type-4 character graphs are most prevalent in the leftmost piece, while type-1 and type-5 character graphs are most prevalent in the rightmost piece, as shown in Figure 4.One page of a document has only one leftmost and one rightmost piece.If the leftmost and rightmost pieces can be identified in the shredded set, then the number of pages can be calculated based on the quantity of these pieces, and the clustering number can be obtained.An analysis of the digital number distribution in the pieces showed that in a page of a document, the type-1 and type-4 character graphs are most prevalent in the leftmost piece, while type-1 and type-5 character graphs are most prevalent in the rightmost piece, as shown in Figure 4.One page of a document has only one leftmost and one rightmost piece.If the leftmost and rightmost pieces can be identified in the shredded set, then the number of pages can be calculated based on the quantity of these pieces, and the clustering number can be obtained.An analysis of the digital number distribution in the pieces showed that in a page of a document, the type-1 and type-4 character graphs are most prevalent in the leftmost piece, while type-1 and type-5 character graphs are most prevalent in the rightmost piece, as shown in Figure 4.One page of a document has only one leftmost and one rightmost piece.If the leftmost and rightmost pieces can be identified in the shredded set, then the number of pages can be calculated based on the quantity of these pieces, and the clustering number can be obtained.An analysis of the digital number distribution in the pieces showed that in a page of a document, the type-1 and type-4 character graphs are most prevalent in the leftmost piece, while type-1 and type-5 character graphs are most prevalent in the rightmost piece, as shown in Figure 4.One page of a document has only one leftmost and one rightmost piece.If the leftmost and rightmost pieces can be identified in the shredded set, then the number of pages can be calculated based on the quantity of these pieces, and the clustering number can be obtained.The proportion Q 14 of type-1 and type-4 character graphs in each piece is calculated by Formula (1), and the proportion Q 15 of type-1 and type-5 character graphs in each piece is calculated by Formula (2).
where Num i indicates the number of the i-th type character graphs in a piece, with i = 1, 4, 5; and Num represents the sum of all the types of character graphs in a piece.Because piece recognition is impacted by noise interference at the piece edges and is affected by classification errors, it is difficult to exactly distinguish the pieces using a single threshold for Q 14 or Q 15 .Thus, in this paper, a dual threshold, Q th1 and Q th2 , is adopted to distinguish the values of Q 14 or Q 15 at different scopes.Then, the types of pieces are determined, where Q th1 Q th2 .
The evaluation of the leftmost piece is used as an example.According to Formula (1), the Q 14 value of the test piece is calculated.When 1 ≥ Q 14 ≥ Q th1 , the test piece is considered a leftmost piece; when Q th1 ≥ Q 14 ≥ Q th2 , the test piece may or may not be the leftmost piece and evaluating it requires artificial assistance; and when Q th2 ≥ Q 14 ≥ 0, the test piece is not a leftmost piece.
The method for evaluating the rightmost piece based on the Q 15 value is similar to the above method.
Using the above operation, the leftmost piece and the rightmost piece in the shreds set are identified, and the number of the leftmost piece N L and the number of the rightmost piece N R can be obtained.Then, N L and N R are input into Formula 3, and the clustering number N C is calculated.

Starting Point of Clusters
After performing the process described in Section 2.1, the leftmost and rightmost pieces of all documents can be obtained.However, a one-to-one relationship has not been established between the leftmost and rightmost pieces, which can cause serious problems when clustering; thus, the leftmost and rightmost pieces must be paired.The matched pieces will be the starting points of the clusters and provide the foundation for the following steps.
Although the content of a Chinese document is diverse, its layout is limited by the text format.Based on rules (the layout rules described in this paper are defined by the Layout Key for Official Document of Party and Government Organs (GB/T 9704-2012) promulgated by the General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China in 2012), such as "the first line of a paragraph should be indented; the end of a paragraph should wrap; sentences should be ended by a specific punctuation mark; certain punctuation should not be placed at the beginning of a text line", etc. in a page of a document, the leftmost character (including text and punctuation) or blank space on a text line is related to the rightmost character or blank space on the previous line of text.Therefore, although a considerable horizontal distance is observed between the leftmost piece and the rightmost piece from the same page of a document, these pieces can be related through the characters or blank spaces in adjacent horizontal text lines.
An example of the interrelationships among the leftmost and rightmost pieces from the same page is shown in Figure 5.The character in the first text line of the rightmost piece is a comma, indicating that the content of a sentence is paused rather than ended, and the character in the second text line of the leftmost piece is text, indicating that the content of the previous sentence continues.These two characters are closely related.Additionally, the block in the second text line of the rightmost piece is blank, indicating that the content of a paragraph has ended, and the block in the third text line of the leftmost piece is blank, indicating the indentation of the first text line at the beginning of a new paragraph.These two blanks are also closely related.The relationship between characters and blanks in adjacent horizontal text lines is similar.
text line of the leftmost piece is text, indicating that the content of the previous sentence continues.These two characters are closely related.Additionally, the block in the second text line of the rightmost piece is blank, indicating that the content of a paragraph has ended, and the block in the third text line of the leftmost piece is blank, indicating the indentation of the first text line at the beginning of a new paragraph.These two blanks are also closely related.The relationship between characters and blanks in adjacent horizontal text lines is similar.The above analysis indicates that the block in a piece can be divided into text, punctuation (punctuation symbols described in this paper are defined by the General Rules for Punctuation (GB/T 15834-2011) promulgated by the General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China in 2011), and blank spaces according to its attributes.Because different types of punctuation lead to different degrees of relevance between two sentences [21], the block of punctuation must be further divided.Considerable differences are observed in the frequency of punctuation (in particular, the frequency of commas and periods in Chinese documents is much greater than that of other punctuation [22]); therefore, for punctuation in the leftmost piece and the rightmost piece, this paper only considers commas and periods while ignoring other punctuation.Blocks in the leftmost piece and the rightmost piece are divided into four types.A type I block is a blank, as shown in Figure 6a; a type II block only contains text, as shown in Figure 6b,c; a type III block contains a period, as shown in Figure 6d,e; and a type IV block contains a comma, as shown in Figure 6f,g.The above analysis indicates that the block in a piece can be divided into text, punctuation (punctuation symbols described in this paper are defined by the General Rules for Punctuation (GB/T 15834-2011) promulgated by the General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China in 2011), and blank spaces according to its attributes.Because different types of punctuation lead to different degrees of relevance between two sentences [21], the block of punctuation must be further divided.Considerable differences are observed in the frequency of punctuation (in particular, the frequency of commas and periods in Chinese documents is much greater than that of other punctuation [22]); therefore, for punctuation in the leftmost piece and the rightmost piece, this paper only considers commas and periods while ignoring other punctuation.Blocks in the leftmost piece and the rightmost piece are divided into four types.A type I block is a blank, as shown in Figure 6a; a type II block only contains text, as shown in Figure 6b,c; a type III block contains a period, as shown in Figure 6d,e; and a type IV block contains a comma, as shown in Figure 6f,g.The above analysis indicates that the block in a piece can be divided into text, punctuation (punctuation symbols described in this paper are defined by the General Rules for Punctuation (GB/T 15834-2011) promulgated by the General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China in 2011), and blank spaces according to its attributes.Because different types of punctuation lead to different degrees of relevance between two sentences [21], the block of punctuation must be further divided.Considerable differences are observed in the frequency of punctuation (in particular, the frequency of commas and periods in Chinese documents is much greater than that of other punctuation [22]); therefore, for punctuation in the leftmost piece and the rightmost piece, this paper only considers commas and periods while ignoring other punctuation.Blocks in the leftmost piece and the rightmost piece are divided into four types.A type I block is a blank, as shown in Figure 6a; a type II block only contains text, as shown in Figure 6b,c; a type III block contains a period, as shown in Figure 6d,e; and a type IV block contains a comma, as shown in Figure 6f,g.Because of restrictions in the document layout, the distributions of these four types of blocks in the leftmost piece and the rightmost piece are different.The type I, type II, type III, and type IV blocks occur in the rightmost piece, whereas only the type I and type II blocks occur in the leftmost piece.We describe the degree of correlation of a block in a text line of the rightmost piece and a block in the next text line of the leftmost piece via the probability shown in Table 1.
Table 1.Correlation of two blocks in the rightmost piece and the leftmost piece.In Table 1, i represents the block type in a text line of the rightmost piece and j represents the block type in the next text line of the leftmost piece.Table 1 reflects the probability of occurrence of different types of leftmost blocks for different types of rightmost blocks.
The operation to match the rightmost piece with the leftmost piece is as follows: First, every piece is divided into a series of blocks in the vertical direction.Second, to classify the blocks, the text, periods, commas, and blanks in the blocks are distinguished by the method described in [23].Third, a rightmost piece R i is selected arbitrarily, and the matching scores SC(i) of R i and all the leftmost pieces are calculated: where i represents the i-th rightmost piece, SC j,i represents the matching score between the i-th rightmost piece and the j-th leftmost piece, and α represents the total number of leftmost pieces.SC j,i is expressed by the cumulative value of the correlation of the block in the leftmost piece and the block in the rightmost piece.
where n represents the number of text lines (number of blocks) in a piece and P(k + 1, k) represents the degree of correlation between the block in the k-th text line of the rightmost piece and the block in the k + 1-th text line of the leftmost piece.Subsequently, the leftmost piece L i with the highest matching score to R i is found; thus, L i is a leftmost piece that came from the same page as R i .
where L i represents the index number corresponding to the leftmost piece.The above steps are repeated until all the rightmost and leftmost pieces are paired; then, the entire matching algorithm ends.

Piece Clustering Based on the Regional Division
As a carrier of characters, a document can be divided into several paragraphs according to the content hierarchy, and explicit boundary markers occur between the different paragraphs [24,25].Different documents lead to different paragraph layouts because of the diverse content [26].However, in the same document, the paragraph layout in different regions is correlated because of the constraints of the writing format.As a derivative of the document, the piece also has the corresponding attribute of the document; therefore, the layouts of pieces from different pages are different, while the layouts of pieces from the same page are relevant.
Based on the above analysis, a page of a shredded document is divided into three areas, as shown in Figure 7.The beginning of each paragraph in the document is area 1; the end of each paragraph in the document is area 2; and the middle region of the document is area 3. The contents marked by black represent text or punctuation, and the contents marked by white represent blank spaces.The red line L a indicates the leftmost piece, the red line L b indicates the critical piece between area 1 and area 3, the red line L c indicates the critical piece between area 3 and area 2, and the red line L d indicates the rightmost piece.For real pieces, the range of the three areas in a document is not fixed because the paragraph layouts of different documents are not the same, which means that the positions of b L and c L vary among different documents.As shown in Figure 8, to clearly divide the regions where the pieces belong, we divide the pieces (excluding the leftmost and rightmost pieces) into dense pieces and sparse pieces with a blank line ratio β because the character distribution in a document appears to be "dense in the middle, sparse on both sides" (because of the presence of blanks at the beginning and end of a paragraph).
where m represents the total number of blocks along the vertical direction of a piece (the total number of text lines in a piece) and n represents the total number of blank blocks in the vertical direction of a piece (the total number of blank lines in a piece).A shred with a value of β less than 0.15 is defined as a dense piece, and a shred with a value of β greater than or equal to 0.15 is defined as a sparse piece.For real pieces, the range of the three areas in a document is not fixed because the paragraph layouts of different documents are not the same, which means that the positions of L b and L c vary among different documents.As shown in Figure 8, to clearly divide the regions where the pieces belong, we divide the pieces (excluding the leftmost and rightmost pieces) into dense pieces and sparse pieces with a blank line ratio β because the character distribution in a document appears to be "dense in the middle, sparse on both sides" (because of the presence of blanks at the beginning and end of a paragraph).
where m represents the total number of blocks along the vertical direction of a piece (the total number of text lines in a piece) and n represents the total number of blank blocks in the vertical direction of a piece (the total number of blank lines in a piece).A shred with a value of β less than 0.15 is defined as a dense piece, and a shred with a value of β greater than or equal to 0.15 is defined as a sparse piece.
In general, the dense pieces are located in area 3 of a document, while the sparse pieces are located in areas 1 and 2. The sparse pieces associated with L a are located in area 1, and the sparse pieces associated with L d are located in area 2. Because L b and L c are in critical positions, both dense and sparse pieces can occur.In this paper, L b is a dense piece and L c is a sparse piece.Note that there are differences in the number of dense and sparse pieces from different pages.In general, the dense pieces are located in area 3 of a document, while the sparse pieces are located in areas 1 and 2. The sparse pieces associated with a L are located in area 1, and the sparse pieces associated with L is a sparse piece.Note that there are differences in the number of dense and sparse pieces from different pages.
In this paper, pieces are clustered based on regional divisions.The algorithm is composed of three parts: piece clustering in area 1, piece clustering in area 2, and piece clustering in area 3. The system flowchart is shown in Figure 9.

Piece Clustering in Area 1
From the layout, the pieces in area 1 are located on the left side of the document.Because area 1 is affected by the indentation, the range of area 1 in the horizontal direction is narrow; therefore, the area contains few shreds.In addition, these pieces are closely related because the shredded characters in the horizontal direction are correlated.
Based on the above analysis, we use the leftmost piece , which is obtained in Section 2.2, as the starting point of clustering in area 1, and use the basic matching algorithm proposed in reference [20], which utilizes the number of mismatched combinations and the relevance between pieces to measure the matching degree of pieces, to perform piece matching from left to right.As shown in Figure 10, is the starting point and the pieces on the right side of are agglomerated gradually.In this paper, pieces are clustered based on regional divisions.The algorithm is composed of three parts: piece clustering in area 1, piece clustering in area 2, and piece clustering in area 3. The system flowchart is shown in Figure 9.In general, the dense pieces are located in area 3 of a document, while the sparse pieces are located in areas 1 and 2. The sparse pieces associated with a L are located in area 1, and the sparse pieces associated with L is a sparse piece.Note that there are differences in the number of dense and sparse pieces from different pages.
In this paper, pieces are clustered based on regional divisions.The algorithm is composed of three parts: piece clustering in area 1, piece clustering in area 2, and piece clustering in area 3. The system flowchart is shown in Figure 9.

Piece Clustering in Area 1
From the layout, the pieces in area 1 are located on the left side of the document.Because area 1 is affected by the indentation, the range of area 1 in the horizontal direction is narrow; therefore, the area contains few shreds.In addition, these pieces are closely related because the shredded characters in the horizontal direction are correlated.
Based on the above analysis, we use the leftmost piece , which is obtained in Section 2.2, as the starting point of clustering in area 1, and use the basic matching algorithm proposed in reference [20], which utilizes the number of mismatched combinations and the relevance between pieces to measure the matching degree of pieces, to perform piece matching from left to right.As shown in Figure 10, is the starting point and the pieces on the right side of are agglomerated gradually.

Piece Clustering in Area 1
From the layout, the pieces in area 1 are located on the left side of the document.Because area 1 is affected by the indentation, the range of area 1 in the horizontal direction is narrow; therefore, the area contains few shreds.In addition, these pieces are closely related because the shredded characters in the horizontal direction are correlated.
Based on the above analysis, we use the leftmost piece L a , which is obtained in Section 2.2, as the starting point of clustering in area 1, and use the basic matching algorithm proposed in reference [20], which utilizes the number of mismatched combinations and the relevance between pieces to measure the matching degree of pieces, to perform piece matching from left to right.As shown in Figure 10, L a is the starting point and the pieces on the right side of L a are agglomerated gradually.
The clustering operations proceed as follows: First, one shred L ai (i.e., the leftmost piece in the i-th page of the document) is chosen randomly from all the leftmost pieces obtained in Section 2.2 and is used as the starting point.Second, by applying the basic matching algorithm (proposed in [19]) from left to right, a piece is found that matches L ai in the set S that includes all the dense and sparse pieces to be tested.Third, the two matched pieces are regarded as a whole, and the basic matching algorithm is used to match them with other pieces.The above steps are repeated until two dense pieces are continuously matched, and the assembly process that begins with L ai is completed.We use the second matched dense piece as the critical piece L bi .Then, the assembly process that begins with the other leftmost piece is completed by the same method.When all the assembly processes are complete, piece clustering in area 1 terminates.
As this matching process proceeds, the number of pieces in set S is gradually reduced.In addition, to incorporate the influence of document skew (because people do not place documents vertically into a shredder) on the blank and character distribution in the shreds, two continuous dense pieces are set as the clustering end condition in this paper and the second dense piece (rather than the first dense piece) is set as a critical shred between area 1 and area 3.
From the layout, the pieces in area 1 are located on the left side of the document.Because area 1 is affected by the indentation, the range of area 1 in the horizontal direction is narrow; therefore, the area contains few shreds.In addition, these pieces are closely related because the shredded characters in the horizontal direction are correlated.
Based on the above analysis, we use the leftmost piece , which is obtained in Section 2.2, as the starting point of clustering in area 1, and use the basic matching algorithm proposed in reference [20], which utilizes the number of mismatched combinations and the relevance between pieces to measure the matching degree of pieces, to perform piece matching from left to right.As shown in Figure 10, is the starting point and the pieces on the right side of are agglomerated gradually.

Piece Clustering in Area 2
The pieces in area 2 are located on the right side of an entire document.Because the usual Chinese document typography includes a horizontal arrangement in which words start from the left side [27], area 2 mainly reflects the layout of the ends of paragraphs.As shown in Figure 11, three types of characters (words and punctuation) and blank distributions are observed in the horizontal direction at area 2. The first type is "from character to character", which means that the paragraph does not end or the paragraph ends just at the rightmost area of a document; therefore, the entire line in area 2 consists of characters (see the regions surrounded by a red border in Figure 11).The second type is the "from blank to blank", which means that the paragraph has ended in the front area; therefore, the entire region in area 2 is blank (see the regions surrounded by a blue border in Figure 11).The third type is the "from character to blank", which means that the paragraph ends in area 2 and the left side of the line is a character; therefore, the right side is blank (see the regions surrounded by a green border in Figure 11).The clustering operations proceed as follows: First, one shred i a L (i.e., the leftmost piece in the i-th page of the document) is chosen randomly from all the leftmost pieces obtained in Section 2.2 and is used as the starting point.Second, by applying the basic matching algorithm (proposed in [19]) from left to right, a piece is found that matches i a L in the set S that includes all the dense and sparse pieces to be tested.Third, the two matched pieces are regarded as a whole, and the basic matching algorithm is used to match them with other pieces.The above steps are repeated until two dense pieces are continuously matched, and the assembly process that begins with i a L is completed.We use the second matched dense piece as the critical piece i b L .Then, the assembly process that begins with the other leftmost piece is completed by the same method.When all the assembly processes are complete, piece clustering in area 1 terminates.
As this matching process proceeds, the number of pieces in set S is gradually reduced.In addition, to incorporate the influence of document skew (because people do not place documents vertically into a shredder) on the blank and character distribution in the shreds, two continuous dense pieces are set as the clustering end condition in this paper and the second dense piece (rather than the first dense piece) is set as a critical shred between area 1 and area 3.

Piece Clustering in Area 2
The pieces in area 2 are located on the right side of an entire document.Because the usual Chinese document typography includes a horizontal arrangement in which words start from the left side [27], area 2 mainly reflects the layout of the ends of paragraphs.As shown in Figure 11, three types of characters (words and punctuation) and blank distributions are observed in the horizontal direction at area 2. The first type is "from character to character", which means that the paragraph does not end or the paragraph ends just at the rightmost area of a document; therefore, the entire line in area 2 consists of characters (see the regions surrounded by a red border in Figure 11).The second type is the "from blank to blank", which means that the paragraph has ended in the front area; therefore, the entire region in area 2 is blank (see the regions surrounded by a blue border in Figure 11).The third type is the "from character to blank", which means that the paragraph ends in area 2 and the left side of the line is a character; therefore, the right side is blank (see the regions surrounded by a green border in Figure 11).In Figure 11, because the position of the end of a paragraph is indeterminate, we use the rightmost part of area 2 in a document as the starting point, and from right to left, the blanks may not be continuous, although the characters must be continuous.Therefore, for the shreds in area 2, this paper proposes a clustering algorithm based on the line position of characters (LPC algorithm).Flowchart of LPC algorithm is shown in Figure 12, and each step of LPC algorithm is described in detail as follows.In Figure 11, because the position of the end of a paragraph is indeterminate, we use the rightmost part of area 2 in a document as the starting point, and from right to left, the blanks may not be continuous, although the characters must be continuous.Therefore, for the shreds in area 2, this paper proposes a clustering algorithm based on the line position of characters (LPC algorithm).Flowchart of LPC algorithm is shown in Figure 12, and each step of LPC algorithm is described in detail as follows.

LPC (line position of characters) Algorithm:
Step 1: We take the remaining sparse pieces in set S as the shreds to be tested, and after the processing in Section 2.3.1, these pieces make up a set X.
Step 2: Using the method proposed in [28], all the rightmost pieces and the pieces in set X are transformed into the corresponding binary code sequence; therefore, a character block in the piece is represented by 1 and a blank block in the piece is represented by 0.
Step 3: According to the results of Section 2.2, a rightmost piece L di (i.e., the rightmost piece of the i-th page document) corresponding to the leftmost piece L ai (i.e., the leftmost piece of the i-th page document) is randomly selected as the starting point of the cluster.
Step 4: The line positions of all 1s in L di are recorded, and a piece X j in the set X is randomly selected.X j and L di are compared line by line from top to bottom, and when all lines with 1 in L di also have 1s in X j then X j and L di belong to the same cluster, whereas if all lines with 1 in L di are not 1s in X j , then X j and L di do not belong to the same cluster.
Step 5: Step 4 is repeated until all pieces that conform to the condition that come from the same cluster as L di in the set X have been classified.
Step 6: The pieces that have been grouped with L di are defined as a set Y (Y ⊂ X).A piece Y j is randomly selected from Y and Y j is XOR'ed with L di -two binary code sequences bitwise XOR operation.Finally, the result of the bitwise operation is summed, and the sum expresses the difference degree of the two pieces.
Step 7: Step 6 is repeated until all pieces in set Y XOR with L di .The piece with the greatest difference degree is identified and represents the largest difference in the layout with L di .This piece is the critical piece L ci between area 2 and area 3 in the i-th page.
Step 8: Repeat Step 3 to Step 7 until all clustering beginning with the rightmost piece is completed; subsequently, the clustering algorithm in area 2 ends.

LPC (line position of characters) Algorithm:
Step 1: We take the remaining sparse pieces in set S as the shreds to be tested, and after the processing in Section 2.3.1, these pieces make up a set X .Step 2: Using the method proposed in [28], all the rightmost pieces and the pieces in set X are transformed into the corresponding binary code sequence; therefore, a character block in the piece is represented by 1 and a blank block in the piece is represented by 0.
Step 3: According to the results of Section 2.2, a rightmost piece i d L (i.e., the rightmost piece of the i-th page document) corresponding to the leftmost piece i a L (i.e., the leftmost piece of the i-th page document) is randomly selected as the starting point of the cluster.
Step 4: The line positions of all 1s in i d L are recorded, and a piece j X in the set X is randomly selected.j X and i d L are compared line by line from top to bottom, and when all lines with 1 in i d L also have 1s in j X, then j X and i d L belong to the same cluster, whereas if all lines with 1 in i d L are not 1s in j X, then j X and i d L do not belong to the same cluster.
Step 5: Step 4 is repeated until all pieces that conform to the condition that come from the same cluster as i d L in the set X have been classified.
Step 6: The pieces that have been grouped with    It should be noted that the number of character blocks contained in the sparse pieces clustered with L d is greater than or equal to the number of character blocks contained in L d , as shown in Figure 13, because in area 2 of a document, the rightmost character distribution is the sparsest, and as the position moves to the left, the sparseness gradually decreases.It should be noted that the number of character blocks contained in the sparse pieces clustered with d L is greater than or equal to the number of character blocks contained in d L , as shown in Figure 13, because in area 2 of a document, the rightmost character distribution is the sparsest, and as the position moves to the left, the sparseness gradually decreases.In the set of rightmost pieces, the probability P that all character lines in a piece are contained by another piece is small, as shown in Equation (8).Therefore, the pieces from the different pages of documents in area 2 are unlikely to be misclassified, and the clustering method using the line position of characters is reliable.
where i is the total number of blocks in a piece, j is the number of character blocks in a piece, and , 0 i j ≠ , i j > , j i C represents the combination value of i and .

Piece Clustering in Area 3
For area 3 of a document, the left and right boundary pieces are b In the double-pronged attack strategy, which is applied for Figure 14a, because the text and punctuation in the paragraph are continuous in the horizontal direction instead of intermittent [29], when the two boundary pieces  In the set of rightmost pieces, the probability P that all character lines in a piece are contained by another piece is small, as shown in Equation (8).Therefore, the pieces from the different pages of documents in area 2 are unlikely to be misclassified, and the clustering method using the line position of characters is reliable.
where i is the total number of blocks in a piece, j is the number of character blocks in a piece, and i, j = 0, i > j, C j i represents the combination value of i and j.

Piece Clustering in Area 3
For area 3 of a document, the left and right boundary pieces are L b and L c , respectively, and these pieces on the same page can be obtained from Sections 2.3.1 and 2.3.2.From the view of boundary shreds, the pieces in area 3 can be divided into two cases.In the first case, the boundary pieces L b and L c have blank blocks in the same line, as shown in Figure 14a; and in the second case, the boundary pieces L b and L c do not have blank blocks in the same line, as shown in Figure 14b.In the gradual agglomeration strategy, which is applied for Figure 14b, because the end of a paragraph is random, the pieces in area 3 and both boundary shreds lack uniform distributions of characters and blanks.The interrelationships of the text structures (blocks) in the adjacent pieces are Based on the above analysis, this paper proposes a double-pronged attack strategy and gradual agglomeration strategy to achieve clustering.
In the double-pronged attack strategy, which is applied for Figure 14a, because the text and punctuation in the paragraph are continuous in the horizontal direction instead of intermittent [29], when the two boundary pieces L b and L c from the same page have one or more blank blocks in the same line, then the paragraphs in these line positions have ended; therefore, all pieces in area 3 of this page of a document are also blank blocks in these line positions.If the pieces have the same blank blocks with both boundary shreds, then they are in the same cluster as two boundary shreds; otherwise, they are not.
In the gradual agglomeration strategy, which is applied for Figure 14b, because the end of a paragraph is random, the pieces in area 3 and both boundary shreds lack uniform distributions of characters and blanks.The interrelationships of the text structures (blocks) in the adjacent pieces are used to gradually match the pieces of area 3 with the boundary shred.Using the basic matching algorithm proposed in reference [20], the boundary shred L c is taken as a starting point; then, the pieces in area 3 are gradually absorbed into the cluster via matching from right to left.When the matching degree of pieces reaches a threshold, the clustering is complete.
The following steps constitute the clustering algorithm, which uses the blanks on the same line to realize piece clustering in area 3. We refer to this algorithm as the "blanks on the same line" (BSL) algorithm, and flowchart of BSL algorithm is shown in Figure 15.

BSL (blanks on the same line) Algorithm:
Step 1: Dense pieces (excluding the dense pieces that have been clustered in Section 2.3.1) are considered the shreds to be tested, and these shreds form a set Z.
Step 2: All boundary pieces obtained from Sections 2.3.1 and 2.3.2 constitute the set L 1 : where (L bi , L ci ) represents a pair of boundary pieces in area 3 of the i-th page, L bi is the left boundary piece, L ci is the right boundary piece, with i ⊂ 1, 2, • • •n, and n is the number of pages.
Step 3: Each pair of boundary pieces in set L 1 are traversed in the vertical direction to identify the blank blocks with the same line positions, which we call BSL.Then, the number and position of BSLs in each pair of boundary pieces are recorded.
Step 4: Pairs of boundary pieces in the set L 1 are arranged in descending order according to the number of BSLs in each pair of boundary pieces.
Step 5: Pairs of boundary pieces that have the largest number of BSLs are extracted from L 1 ; namely, there are k (k = 0) pairs of boundary pieces that contain the largest number of BSLs, and they are used as left-right reference pieces.Subsequently, the pieces in set Z are clustered by the reference grouping.Step 6: Step 5 is repeated until all pairs of boundary pieces in set L 1 are processed, and when L 1 is empty, the entire algorithm ends.
In the above algorithm, the reference grouping is an important part of realizing clustering, and its algorithm flowchart is shown in Figure 16.When the pieces in set Z are clustered, the number of pairs of left-right reference pieces (L-R reference pieces), i.e., k, needs to be determined first.If k = 1, then the shreds are clustered using the double-pronged attack strategy (DA Strategy).If k = 1, then there are several pairs of L-R reference pieces, and according to the relationship of blanks on the same line position (BSLP) in different pairs of L-R reference pieces, we divide the pairs into three cases.For the first case, the BSLPs in k pairs of left-right reference pieces are different.We take each pair of left-right reference pieces as the reference and use the DA Strategy to group the shreds.For the second case, the BSLPs in k pairs of left-right reference pieces are identical.We take each pair of left-right reference pieces as the reference and use the gradual agglomeration strategy (GA Strategy) to group the shreds.For the third case, in k pairs of left-right reference pieces, the BSLPs in u pairs of the pieces are different, and the BSLPs in v pairs of the pieces are identical, where u + v = k.First, we take u pairs of left-right reference pieces as the reference and use the DA Strategy to group the shreds.Second, we take v pairs of left-right reference pieces as the reference and use the GA Strategy to group the shreds.It should be noted that if a BSL is not observed in a pair of left-right reference pieces, we use the GA Strategy to group the shreds.

Experimental Results and Discussion
The method proposed in this paper is tested with real pieces.Ten-page documents from a long It should be noted that if a BSL is not observed in a pair of left-right reference pieces, we use the GA Strategy to group the shreds.

Experimental Results and Discussion
The method proposed in this paper is tested with real pieces.Ten-page documents from a long file are randomly selected as the original dataset.In the dataset, the paper size is A4, the paper color is white, and the type of paper is blank paper not squared paper.All documents are edited using Microsoft Office Word (Microsoft Corporation, Redmond, WA, USA) following the unified format: the font is Song style, the character color is black, the character size is small four, and the line spacing is 1.5.All documents are shredded by a Sunwood ST9290 shredder (Sunwood Holding Group Co., Ltd., Yuhuan, Zhejiang, China), and 462 pieces are produced in total (excluding the blank shreds); each piece has a width of 3 mm.The experiment is executed on a computer (Mingsu-U2, Ningdong Electronic Technology Co., Ltd, Guangzhou, Guangdong, China) with an Intel Core 2 3.0 GHz CPU, 4 GB memory, and a 500 GB hard disk.
In the original dataset (the dataset S1 in supplementary), the 10-page documents are designated A to J, and all shreds in each page document are numbered in sequence; for example, the original index numbers of the pieces in document A range from A1 to A46.In the actual test, because shreds from different pages are mixed together and the shred sequences are disrupted, the pieces are renumbered from 1 to 462 to constitute the test dataset.In the experimental process, the test index numbers of shreds are visible, and the original index numbers of shreds are invisible.

Clustering Number Results
The clustering numbers of the test dataset can be obtained by the process described in Section 2.1.The method for calculating the clustering number must first identify all the leftmost and rightmost pieces in the dataset.The size of Q 14 in Formula ( 1) is the basis for judging whether a shred is the leftmost piece and the size of Q 15 in Formula (2) is the basis for judging whether a shred is the rightmost piece.Therefore, based on the detection results for 100 pages of documents in the experiment, including 4668 shreds, we set the dual thresholds of Q 14 to Q th1 = 0.9 and Q th1 = 0.85, and the dual thresholds of Q 14 to Q th1 = 0.9 and Q th1 = 0.85.The identification results of the leftmost and rightmost pieces in the test dataset are shown in Figures 17 and 18, respectively.
, th th Q Q , which means that the number of leftmost pieces is 10 and can be obtained without manual assistance.Figure 18 shows the identification results for the rightmost pieces.Similar to the above analysis, we know that the number of rightmost pieces is 10.A comparison with the original index numbers of shreds shows that these 20 shreds are the leftmost and rightmost pieces.Therefore, although the identification of actual shreds is affected by noise interference at the edge of a piece and classification errors, the method proposed in Section 2.1 can effectively identify the leftmost and rightmost pieces.Based on the number of leftmost and rightmost pieces, the clustering number of shreds in the test dataset is calculated as 10. Figure 17 shows the identification results for the leftmost pieces.The Q 14 values of different shreds in Figure 17 indicate that there are 10 shreds in the range [Q th1 , 1] but none in the range [Q th2 , Q th1 ), which means that the number of leftmost pieces is 10 and can be obtained without manual assistance.Figure 18 shows the identification results for the rightmost pieces.Similar to the above analysis, we know that the number of rightmost pieces is 10.A comparison with the original index numbers of shreds shows that these 20 shreds are the leftmost and rightmost pieces.Therefore, although the identification of actual shreds is affected by noise interference at the edge of a piece and classification errors, the method proposed in Section 2.1 can effectively identify the leftmost and rightmost pieces.Based on the number of leftmost and rightmost pieces, the clustering number of shreds in the test dataset is calculated as 10.

Results for the Starting Points of Clusters
For the leftmost and rightmost pieces obtained in Section 2.1, we use the method proposed in Section 2.2 to calculate matching scores between each rightmost and all leftmost pieces, respectively.The results are shown in Figure 19.The matching score between the rightmost and leftmost pieces from the same page is greater than the matching score between the rightmost and leftmost pieces from different pages.Although the matching scores for several pages are not high (the loss of character information and misjudgment of a few blocks by noise affects the matching scores between shreds), as shown in Figure 19e, the final matching result is not impeded.Because the misjudged blocks are in the minority, the matching scores between the rightmost and leftmost pieces from the same page are clearly higher than the matching scores of other shreds.
To clarify the matching relationship of the rightmost and leftmost pieces, we use the original index number instead of the test index number to mark each piece.
from the same page is greater than the matching score between the rightmost and leftmost pieces from different pages.Although the matching scores for several pages are not high (the loss of character information and misjudgment of a few blocks by noise affects the matching scores between shreds), as shown in Figure 19e, the final matching result is not impeded.Because the misjudged blocks are in the minority, the matching scores between the rightmost and leftmost pieces from the same page are clearly higher than the matching scores of other shreds.To clarify the matching relationship of the rightmost and leftmost pieces, we use the original index number instead of the test index number to mark each piece.

Results of Piece Clustering Based on Regional Divisions
Based on the pairing of rightmost pieces with leftmost pieces described in Section 2.2, we adopt the method proposed in Section 2.3 to cluster the shreds in the test dataset.Figure 20a-c  the method proposed in Section 2.3 to cluster the shreds in the test dataset.Figure 20a-c   As shown in Figure 20, the number of shreds in each cluster gradually increases in stepwise fashion during clustering.The number of shreds in each cluster in area 1 is low, as shown in Figure As shown in Figure 20, the number of shreds in each cluster gradually increases in stepwise fashion during clustering.The number of shreds in each cluster in area 1 is low, as shown in Figure 20a, while the number of shreds in each cluster in area 2 is higher, as shown in Figure 20b.This distribution is consistent with the actual layout of the document.In addition, misclassified shreds were not generated during these two parts of the clustering process, which shows that the method is effective.One misclassified shred occurred in Section 2.3.3, as shown in Figure 20c, and the cause of this misclassified shred is shown in Figure 21.Because the misclassified shred (the original index number is E11) contains only a small fraction of a comma in the 17th line (the comma is split into two pieces), it causes a block that should include punctuation to be judged as a blank block; however, this misjudgment leads to a shred E11 where the dense pieces in cluster J have blanks on the same line.Therefore, when the dense pieces in cluster J are clustered under the DA Strategy, shred E11 is incorrectly classified into cluster J.
distribution is consistent with the actual layout of the document.In addition, misclassified shreds were not generated during these two parts of the clustering process, which shows that the method is effective.One misclassified shred occurred in Section 2.3.3, as shown in Figure 20c, and the cause of this misclassified shred is shown in Figure 21.Because the misclassified shred (the original index number is E11) contains only a small fraction of a comma in the 17th line (the comma is split into two pieces), it causes a block that should include punctuation to be judged as a blank block; however, this misjudgment leads to a shred E11 where the dense pieces in cluster J have blanks on the same line.Therefore, when the dense pieces in cluster J are clustered under the DA Strategy, shred E11 is incorrectly classified into cluster J.However, 56 shreds (12.12% in total) remain unclassified, as shown in Figure 20c, which illustrates that under conditions with various real shreds, the method proposed in this paper has certain deficiencies; thus, further improvements must be made to classify the residual shreds.

Treatment of Residual Shreds
The residual shreds are composed of 44 sparse pieces and 12 dense pieces.First, we analyze the sparse pieces in the majority and find that the reason why they are not clustered is due to misjudgments of block type caused primarily by noise at the edge of a shred and a small part of a word or a punctuation mark in a shred (caused by shredder slicing), as shown in Figure 22.However, 56 shreds (12.12% in total) remain unclassified, as shown in Figure 20c, which illustrates that under conditions with various real shreds, the method proposed in this paper has certain deficiencies; thus, further improvements must be made to classify the residual shreds.

Treatment of Residual Shreds
The residual shreds are composed of 44 sparse pieces and 12 dense pieces.First, we analyze the sparse pieces in the majority and find that the reason why they are not clustered is due to misjudgments of block type caused primarily by noise at the edge of a shred and a small part of a word or a punctuation mark in a shred (caused by shredder slicing), as shown in Figure 22.
distribution is consistent with the actual layout of the document.In addition, misclassified shreds were not generated during these two parts of the clustering process, which shows that the method is effective.One misclassified shred occurred in Section 2.3.3, as shown in Figure 20c, and the cause of this misclassified shred is shown in Figure 21.Because the misclassified shred (the original index number is E11) contains only a small fraction of a comma in the 17th line (the comma is split into two pieces), it causes a block that should include punctuation to be judged as a blank block; however, this misjudgment leads to a shred E11 where the dense pieces in cluster J have blanks on the same line.Therefore, when the dense pieces in cluster J are clustered under the DA Strategy, shred E11 is incorrectly classified into cluster J.However, 56 shreds (12.12% in total) remain unclassified, as shown in Figure 20c, which illustrates that under conditions with various real shreds, the method proposed in this paper has certain deficiencies; thus, further improvements must be made to classify the residual shreds.

Treatment of Residual Shreds
The residual shreds are composed of 44 sparse pieces and 12 dense pieces.First, we analyze the sparse pieces in the majority and find that the reason why they are not clustered is due to misjudgments of block type caused primarily by noise at the edge of a shred and a small part of a word or a punctuation mark in a shred (caused by shredder slicing), as shown in Figure 22.Although differences may occur in the layout of a character and a blank between different shreds, neighboring shreds from the same page usually present a similar layout [28].Therefore, even if a few blocks in the shred are misjudged, the layouts of the neighbor shreds from the same page are still closely related.Based on the above analysis, we use the total number of the same type of blocks between two shreds to assess the residual sparse pieces.As shown in Figure 23, a line-by-line comparison of the blocks between two shreds is executed along the vertical direction.If the types of two blocks are the same, then the line is marked as s; otherwise, it is marked as d.The sum of s is the total number of the same types of blocks TS, and TS can reflect the neighborhood degree of two shreds.
Although differences may occur in the layout of a character and a blank between different shreds, neighboring shreds from the same page usually present a similar layout [28].Therefore, even if a few blocks in the shred are misjudged, the layouts of the neighbor shreds from the same page are still closely related.Based on the above analysis, we use the total number of the same type of blocks between two shreds to assess the residual sparse pieces.As shown in Figure 23, a line-by-line comparison of the blocks between two shreds is executed along the vertical direction.If the types of two blocks are the same, then the line is marked as s; otherwise, it is marked as d .The sum of s is the total number of the same types of blocks TS , and TS can reflect the neighborhood degree of two shreds.The specific process of residual sparse piece clustering is as follows.A sparse piece i dp is selected randomly from the set of residual sparse pieces DP .

{ }
, , , , , where , i j TS represents the total number of the same types of blocks between the i-th residual sparse piece and the j-th shred that has been classified; and γ indicates the total number of shreds that have been classified.Then, we search for the shred w, which has the highest total number of the same types of blocks between i dp and itself: where w represents the test index number of the corresponding shred.
If the value of w is unique, then i dp and w are considered to be in the same cluster and i dp is incorporated into the same cluster as w; however, if w has x values ( ( ) TS i The specific process of residual sparse piece clustering is as follows.A sparse piece dp i is selected randomly from the set of residual sparse pieces DP.
where dp i represents the i-th residual sparse piece and α indicates the total number of residual sparse pieces.The total number of the same types of blocks between dp i and shreds that have been classified is calculated separately, and the calculation results form a set TS(i): where TS i,j represents the total number of the same types of blocks between the i-th residual sparse piece and the j-th shred that has been classified; and γ indicates the total number of shreds that have been classified.Then, we search for the shred w, which has the highest total number of the same types of blocks between dp i and itself: where w represents the test index number of the corresponding shred.
If the value of w is unique, then dp i and w are considered to be in the same cluster and dp i is incorporated into the same cluster as w; however, if w has x values (x ≥ 2), then the set L m must be evaluated.L m consists of x candidate shreds: where L mx represents the x-th candidate shred.If all shreds in L m are from the same cluster, then dp i is incorporated into the cluster, whereas if the shreds in L m are from different clusters, then dp i is marked as an unclassifiable shred.
The same method is used to evaluate the other residual sparse pieces until all residual sparse pieces have been processed.Then, the algorithm is complete.
Experiments demonstrate that these improvements are effective.Figure 24 presents the results of processing residual sparse pieces.All residual sparse pieces are incorporated into the correct clusters, and the shreds that are not yet classified are dense pieces.These findings illustrate that the improvements fully exploit the relevance between neighbor pieces, correct for the negative effect of a few blocks misjudged in the original method, and further improve the accuracy of piece clustering.
i dp is incorporated into the cluster, whereas if the shreds in m L are from different clusters, then i dp is marked as an unclassifiable shred.
The same method is used to evaluate the other residual sparse pieces until all residual sparse pieces have been processed.Then, the algorithm is complete.
Experiments demonstrate that these improvements are effective.Figure 24 presents the results of processing residual sparse pieces.All residual sparse pieces are incorporated into the correct clusters, and the shreds that are not yet classified are dense pieces.These findings illustrate that the improvements fully exploit the relevance between neighbor pieces, correct for the negative effect of a few blocks misjudged in the original method, and further improve the accuracy of piece clustering.Second, the residual dense pieces are not classified because certain factors, such as information loss at the edge of a shred and classifier error, affect these shreds when they are clustered according to the GA Strategy.Because of these effects, the shreds are unable to meet the match conditions; thus, they remain unclassified.In addition, because there are many characters and few blanks in a dense piece, the dense pieces from different pages often have the same or similar layouts.Therefore, the method of processing residual sparse pieces does not effectively manage these shreds.Ultimately, 12 dense pieces failed to cluster.

Summary
After performing the processing described in Sections 3.1 to 3.4, the clustering results in the test dataset are obtained, as shown in Table 2.The average accuracy of clustering is 97.19% (449/462), 12 shreds are not classified, and one shred is misclassified.These results show that the clustering method proposed in this paper has a high accuracy and a low error rate.Moreover, for homologous pieces that appear indistinguishable, the method proposed in this paper can fully exploit their internal relationships and differences to achieve effective clustering.Although a shortage of shred processing occurs in area 3 of a document, considering the complexity of real shreds, which are affected by noise interference, information loss, and other factors, the clustering results of Table 2 are satisfactory.Additionally, from the point of view of time complexity, the complexity of our algorithm in various stages is not high, with the exception of the process of Section 2.1.The time consumption of this algorithm is acceptable.The time complexity is (process of identifying the starting point of clusters).Since the number of shreds that are processed in this stage is substantially less than the total number of shreds, the time consumption in this stage Second, the residual dense pieces are not classified because certain factors, such as information loss at the edge of a shred and classifier error, affect these shreds when they are clustered according to the GA Strategy.Because of these effects, the shreds are unable to meet the match conditions; thus, they remain unclassified.In addition, because there are many characters and few blanks in a dense piece, the dense pieces from different pages often have the same or similar layouts.Therefore, the method of processing residual sparse pieces does not effectively manage these shreds.Ultimately, 12 dense pieces failed to cluster.

Summary
After performing the processing described in Section 3.1 to Section 3.4, the clustering results in the test dataset are obtained, as shown in Table 2.The average accuracy of clustering is 97.19% (449/462), 12 shreds are not classified, and one shred is misclassified.These results show that the clustering method proposed in this paper has a high accuracy and a low error rate.Moreover, for homologous pieces that appear indistinguishable, the method proposed in this paper can fully exploit their internal relationships and differences to achieve effective clustering.Although a shortage of shred processing occurs in area 3 of a document, considering the complexity of real shreds, which are affected by noise interference, information loss, and other factors, the clustering results of Table 2 are satisfactory.Additionally, from the point of view of time complexity, the complexity of our algorithm in various stages is not high, with the exception of the process of Section 2.1.The time consumption of this algorithm is acceptable.The time complexity is O(n 3 ) in Section 2.1 (process of computing the clustering number).Because the training and testing of the classifier for recognizing five different types of blocks in shreds is very time-consuming.The time complexity is O(n 2 ) in Section 2.2 (process of identifying the starting point of clusters).Since the number of shreds that are processed in this stage is substantially less than the total number of shreds, the time consumption in this stage is minimal.The time complexity is O(n 2 ) in Section 2.3 (process of achieving piece clustering based on the regional division).In this stage, the complexity of the piece clustering in area 1 and area 3 is greater than the complexity of the piece clustering in area 2. of other pieces.Additionally, we fully mine the document property of the pieces, and the pieces in different areas are distinguished and associated effectively based on the feature of paragraph layout.Meanwhile, according to the similarity of adjacent pieces in the layout, the interference of individual blocks is suppressed in clustering by using the total number of the same types of blocks.The clustering effect is obviously improved.From the evaluation results, the purity of our method is very high, and the silhouette coefficient is not very high.It is because that the paragraph layouts in different areas of a document are different, and the paragraph layouts in the same area of a document are similar.Accordingly, for a cluster of pieces from a document, there is a high similarity between pieces in the same area, and the similarity between pieces in different areas is not high.The silhouette coefficient uses the similarity of pieces as an important basis for evaluating clustering results in the range of an entire cluster, and it does not adequately reflect the similarity of homologous pieces in a small range.Thus the silhouette coefficient of our method is not very high.For the k-means clustering, the algorithm only uses the layout similarity of characters and blanks to cluster, but it ignores the differences and correlations between pieces in different areas of a page.Thus the pieces from different pages are clustered because of the similar layout, and the separations between different clusters are not high.Moreover, initial centers of clusters that are randomly selected also have an influence on the clustering.The clustering effect of the k-means algorithm is not good, its purity is 67.72%, and its silhouette coefficient is 0.4158.For the hierarchical clustering, although its purity and silhouette coefficient are higher than the k-means, it only realizes clustering based on the layout similarity between pieces, and it does not consider the differences and correlations between pieces from the perspective of the overall document layout, which means that some pieces that came from a page are divided into different clusters because of the different layouts.Therefore, the clustering effect of the hierarchical algorithm is not satisfactory.

Conclusions
This paper presents a novel clustering method for Chinese homologous pieces that are difficult to distinguish visually.The pieces are clustered by three steps: computing the clustering number, identifying the starting point of clusters, and achieving piece clustering based on the regional division.In the step of computing the clustering number, based on the distribution features of characters in the piece, the leftmost and rightmost pieces in the documents are recognized, and the clustering number is calculated.In the step of identifying the starting point of clusters, this paper employs the correlation of syntax in adjacent text lines, and the leftmost piece and the rightmost piece which come from the same page are exactly matched.In the step of achieving piece clustering based on the regional division, according to the document layout, the pieces are distinguished in different areas of the document, and piece clustering is achieved by the correlation among pieces in different areas.The experimental results show that the proposed method can effectively achieve the clustering of real pieces.Moreover, this method lays the foundation for the resolution of homologous shredded document recovery.
It is worthy to mention that although this paper addresses shredded plain text documents, it still has an application value for the shredded documents with figures and images.In contrast to homologous pieces in the plain text document (the pieces are very similar with regard to content, and a lack of effective feature distinguishes the pieces), the shredded documents with figures and images contain easily extractable features and achieve clustering because the figures and images have potential differences with regard to size and position in real documents.However, document is different from photograph after all.Characters take the primary position in a document.The distribution of characters in different areas of a document remains dense and sparse, a correlation exists among characters in different areas, and the paragraph layout feature of a document remains in the pieces.Thus, we can also apply the method proposed in this paper to process the shredded documents with figures and images.Because figures and images are added to a document, the total relevance of the characters of the pieces in the horizontal direction is weakened, and it has a certain extent effect on piece clustering.The features of figures and images can be used to easily distinguish the pieces, and they can promote

Figure 1 .
Figure 1.Example of homologous pieces.All the pieces in this figure are derived from the same file: (a) and (b) belong to the first page of the document; (c) and (d) belong to the second page of the document; (e) and (f) belong to the third page of the document.

Figure 1 .
Figure 1.Example of homologous pieces.All the pieces in this figure are derived from the same file: (a) and (b) belong to the first page of the document; (c) and (d) belong to the second page of the document; (e) and (f) belong to the third page of the document.
Appl.Sci.2017, 7, 951 4 of 25 using a classifier based on five types of graphical Chinese characters, as shown in Figure3.Finally, the piece is represented as a digital sequence.

Figure 2 .
Figure 2. Piece divided into a series of blocks.

Figure 4 .
Figure 4. Leftmost piece, middle piece, rightmost piece, and their corresponding digital sequences: (a) leftmost piece and (d) corresponding digital sequence; (b) middle piece and (e) corresponding digital sequence; (c) rightmost piece and (f) corresponding digital sequence.

Figure 2 .
Figure 2. Piece divided into a series of blocks.

Figure 2 .
Figure 2. Piece divided into a series of blocks.

Figure 4 .
Figure 4. Leftmost piece, middle piece, rightmost piece, and their corresponding digital sequences: (a) leftmost piece and (d) corresponding digital sequence; (b) middle piece and (e) corresponding digital sequence; (c) rightmost piece and (f) corresponding digital sequence.
Appl.Sci.2017, 7, 951 4 of 25 using a classifier based on five types of graphical Chinese characters, as shown in Figure3.Finally, the piece is represented as a digital sequence.

Figure 2 .
Figure 2. Piece divided into a series of blocks.

Figure 4 .
Figure 4. Leftmost piece, middle piece, rightmost piece, and their corresponding digital sequences: (a) leftmost piece and (d) corresponding digital sequence; (b) middle piece and (e) corresponding digital sequence; (c) rightmost piece and (f) corresponding digital sequence.

Figure 4 .
Figure 4. Leftmost piece, middle piece, rightmost piece, and their corresponding digital sequences: (a) leftmost piece and (d) corresponding digital sequence; (b) middle piece and (e) corresponding digital sequence; (c) rightmost piece and (f) corresponding digital sequence.

Figure 5 .
Figure 5. Example of the interrelationships among the leftmost piece and the rightmost piece on the same page: (a) leftmost piece of the document; (b) rightmost piece of the document.The green frame represents the space range of the block in each text line.

Figure 6 .
Figure 6.Four types of blocks: (a) a type I block; (b) a type II block; (c) a type II block; (d) a type III block; (e) a type III block; (f) a type IV block; (g) a type IV block.

Figure 5 .
Figure 5. Example of the interrelationships among the leftmost piece and the rightmost piece on the same page: (a) leftmost piece of the document; (b) rightmost piece of the document.The green frame represents the space range of the block in each text line.
the leftmost piece is text, indicating that the content of the previous sentence continues.These two characters are closely related.Additionally, the block in the second text line of the rightmost piece is blank, indicating that the content of a paragraph has ended, and the block in the third text line of the leftmost piece is blank, indicating the indentation of the first text line at the beginning of a new paragraph.These two blanks are also closely related.The relationship between characters and blanks in adjacent horizontal text lines is similar.

Figure 5 .
Figure 5. Example of the interrelationships among the leftmost piece and the rightmost piece on the same page: (a) leftmost piece of the document; (b) rightmost piece of the document.The green frame represents the space range of the block in each text line.

Figure 6 .
Figure 6.Four types of blocks: (a) a type I block; (b) a type II block; (c) a type II block; (d) a type III block; (e) a type III block; (f) a type IV block; (g) a type IV block.

Figure 6 .
Figure 6.Four types of blocks: (a) a type I block; (b) a type II block; (c) a type II block; (d) a type III block; (e) a type III block; (f) a type IV block; (g) a type IV block.
Appl.Sci.2017, 7, 951 8 of 25 black represent text or punctuation, and the contents marked by white represent blank spaces.The red line a L indicates the leftmost piece, the red line b L indicates the critical piece between area 1 and area 3, the red line c L indicates the critical piece between area 3 and area 2, and the red line indicates the rightmost piece.

Figure 7 .
Figure 7. Page of a document divided into three areas.

dLFigure 7 .
Figure 7. Page of a document divided into three areas.

25 Figure 8 .
Figure 8. Example of a dense piece and sparse piece: (a) dense piece; (b) sparse piece.The black area represents characters, and the white area represents blank spaces.

dL
are located in area 2. Because b L and c L are in critical positions, both dense and sparse pieces can occur.In this paper, b L is a dense piece and c

Figure 9 .
Figure 9. System flowchart of the piece clustering algorithm.

25 Figure 8 .
Figure 8. Example of a dense piece and sparse piece: (a) dense piece; (b) sparse piece.The black area represents characters, and the white area represents blank spaces.

dL
are located in area 2. Because b L and c L are in critical positions, both dense and sparse pieces can occur.In this paper, b L is a dense piece and c

Figure 9 .
Figure 9. System flowchart of the piece clustering algorithm.

Figure 11 .
Figure 11.Different distributions of characters and blanks in area 2.

Figure 11 .
Figure 11.Different distributions of characters and blanks in area 2.

L
are defined as a set Y ( X Y ⊂ ).A piece j Y is randomly selected from Y and j Y is XOR'ed with i d L -two binary code sequences bitwise XOR operation.Finally, the result of the bitwise operation is summed, and the sum expresses the difference degree of the two pieces.Step 7: Step 6 is repeated until all pieces in set Y XOR with i d L .The piece with the greatest difference degree is identified and represents the largest difference in the layout with i d L .This piece is the critical piece i cL between area 2 and area 3 in the i-th page.

Step 8 :
Repeat Step 3 to Step 7 until all clustering beginning with the rightmost piece is completed; subsequently, the clustering algorithm in area 2 ends.

Figure 12 .
Figure 12.Flowchart of LPC (line position of characters) algorithm.Figure 12. Flowchart of LPC (line position of characters) algorithm.

Figure 12 .
Figure 12.Flowchart of LPC (line position of characters) algorithm.Figure 12. Flowchart of LPC (line position of characters) algorithm.

Figure 13 .
Figure 13.Comparison of the number of character blocks contained in the pieces of area 2. The red boxes mark the character blocks with the same line position in the different pieces.

L
Figure 14b.Based on the above analysis, this paper proposes a double-pronged attack strategy and gradual agglomeration strategy to achieve clustering.In the double-pronged attack strategy, which is applied for Figure14a, because the text and punctuation in the paragraph are continuous in the horizontal direction instead of intermittent[29], when the two boundary pieces b Figure 14b.Based on the above analysis, this paper proposes a double-pronged attack strategy and gradual agglomeration strategy to achieve clustering.In the double-pronged attack strategy, which is applied for Figure14a, because the text and punctuation in the paragraph are continuous in the horizontal direction instead of intermittent[29], when the two boundary pieces b

LFigure 13 .
Figure 13.Comparison of the number of character blocks contained in the pieces of area 2. The red boxes mark the character blocks with the same line position in the different pieces.

Figure 14 .
Figure 14.Distribution of pieces in area 3 of a document: (a) the boundary pieces b L and cL have

Figure 14 .
Figure 14.Distribution of pieces in area 3 of a document: (a) the boundary pieces L b and L c have blank blocks in the same line; (b) the boundary pieces L b and L c do not have blank blocks in the same line.

Figure 15 .
Figure 15.Flowchart of BSL (blanks on the same line) algorithm.In the above algorithm, the reference grouping is an important part of realizing clustering, and its algorithm flowchart is shown in Figure 16.When the pieces in set Z are clustered, the number of pairs of left-right reference pieces (L-R reference pieces), i.e., k , needs to be determined first.If 1 k = , then the shreds are clustered using the double-pronged attack strategy (DA Strategy).If 1 k ≠ , then there are several pairs of L-R reference pieces, and according to the relationship of blanks on the same line position (BSLP) in different pairs of L-R reference pieces, we divide the pairs into three cases.For the first case, the BSLPs in k pairs of left-right reference pieces are different.We take each pair of left-right reference pieces as the reference and use the DA Strategy to group the shreds.For the second case, the BSLPs in k pairs of left-right reference pieces are identical.We take each pair of left-right reference pieces as the reference and use the gradual agglomeration strategy (GA Strategy) to group the shreds.For the third case, in k pairs of left-right reference pieces, the BSLPs in u pairs of the pieces are different, and the BSLPs in v pairs of the pieces are identical, where u v k + = .First, we take u pairs of left-right reference pieces as the reference and use the DA Strategy to group the shreds.Second, we take v pairs of left-right reference pieces as the reference and use the GA Strategy to group the shreds.

Figure 15 .
Figure 15.Flowchart of BSL (blanks on the same line) algorithm.
Appl.Sci.2017, 7, 951 16 of 25 is the rightmost piece.Therefore, based on the detection results for 100 pages of documents in the experiment, including 4668 shreds, we set the dual thresholds of results of the leftmost and rightmost pieces in the test dataset are shown in Figure 17 and Figure 18, respectively.

Figure 17 . 14 Q
Figure 17.Identification results for the leftmost pieces.The ordinate represents the value of the proportion

Figure 17 .
Figure 17.Identification results for the leftmost pieces.The ordinate represents the value of the proportion Q 14 in a shred and the abscissa represents the shreds in the test dataset.

Figure 17 . 14 Q
Figure 17.Identification results for the leftmost pieces.The ordinate represents the value of the proportion

Figure 18 . 15 Q
Figure 18.Identification results for the rightmost pieces.The ordinate represents the value of the proportion

Figure 17
Figure 17 shows the identification results for the leftmost pieces.The 14 Q values of different

Figure 18 .
Figure 18.Identification results for the rightmost pieces.The ordinate represents the value of the proportion Q 15 in a shred and the abscissa represents the shreds in the test dataset.

Figure 19 .
Figure 19.Matching results between each rightmost piece and all leftmost pieces: (a) the rightmost piece in document A and all leftmost pieces; (b) the rightmost piece in document B and all leftmost pieces; (c) the rightmost piece in document C and all leftmost pieces; (d) the rightmost piece in document D and all leftmost pieces; (e) the rightmost piece in document E and all leftmost pieces; (f) the rightmost piece in document F and all leftmost pieces; (g) the rightmost piece in document G and all leftmost pieces; (h) the rightmost piece in document H and all leftmost pieces; (i) the rightmost piece in document I and all leftmost pieces; (j) the rightmost piece in document J and all leftmost pieces.

Figure 19 .
Figure 19.Matching results between each rightmost piece and all leftmost pieces: (a) the rightmost piece in document A and all leftmost pieces; (b) the rightmost piece in document B and all leftmost pieces; (c) the rightmost piece in document C and all leftmost pieces; (d) the rightmost piece in document D and all leftmost pieces; (e) the rightmost piece in document E and all leftmost pieces; (f) the rightmost piece in document F and all leftmost pieces; (g) the rightmost piece in document G and all leftmost pieces; (h) the rightmost piece in document H and all leftmost pieces; (i) the rightmost piece in document I and all leftmost pieces; (j) the rightmost piece in document J and all leftmost pieces.
indicate the clustering results of each stage of Section 2.3.Figure 20a represents the clustering results after Section 2.3.1 processing; Figure 20b represents the clustering results after Section 2.3.2 processing; and Figure 20c represents the clustering results after Section 2.3.3 processing.To clearly reflect the clustering results of each stage, we use a histogram to describe the piece clustering process in each page of the document.
indicate the clustering results of each stage of Section 2.3.Figure 20a represents the clustering results after Section 2.3.1 processing; Figure 20b represents the clustering results after Section 2.3.2 processing; and Figure 20c represents the clustering results after Section 2.3.3 processing.To clearly reflect the clustering results of each stage, we use a histogram to describe the piece clustering process in each page of the document.

Figure 20 .
Figure 20.Clustering results of each stage of Section 2.3: (a) clustering results after Section 2.3.1 processing; (b) clustering results after Section 2.3.2 processing; (c) clustering results after Section 2.3.1 processing.

Figure 20 .
Figure 20.Clustering results of each stage of Section 2.3: (a) clustering results after Section 2.3.1 processing; (b) clustering results after Section 2.3.2 processing; (c) clustering results after Section 2.3.1 processing.

Figure 21 .
Figure 21.Causes of misclassified shreds: (a) dense piece E10 adjacent to E11; (b) dense piece E11; (c) one of the dense pieces in cluster J.

Figure 22 .
Figure 22.Examples of a small part of a word or a punctuation mark in a shred.(a) a small part of a punctuation mark in a shred; (b) a small part of a word in a shred.

Figure 21 .
Figure 21.Causes of misclassified shreds: (a) dense piece E10 adjacent to E11; (b) dense piece E11; (c) one of the dense pieces in cluster J.

Figure 21 .
Figure 21.Causes of misclassified shreds: (a) dense piece E10 adjacent to E11; (b) dense piece E11; (c) one of the dense pieces in cluster J.

Figure 22 .
Figure 22.Examples of a small part of a word or a punctuation mark in a shred.(a) a small part of a punctuation mark in a shred; (b) a small part of a word in a shred.

Figure 22 .
Figure 22.Examples of a small part of a word or a punctuation mark in a shred.(a) a small part of a punctuation mark in a shred; (b) a small part of a word in a shred.

Figure 23 .
Figure 23.Total number of the same types of blocks between two shreds.
where i dp represents the i-th residual sparse piece and α indicates the total number of residual sparse pieces.The total number of the same types of blocks between i dp and shreds that have been classified is calculated separately, and the calculation results form a set :

Figure 23 .
Figure 23.Total number of the same types of blocks between two shreds.

Figure 24 .
Figure 24.Results of processing residual sparse pieces.
Section 2.1 (process of computing the clustering number).Because the training and testing of the classifier for recognizing five different types of blocks in shreds is very time-consuming.The time complexity is 2 ( ) O n in Section 2.2

Figure 24 .
Figure 24.Results of processing residual sparse pieces.