Practical Challenge of Shredded Documents: Clustering of Chinese Homologous Pieces
AbstractWhen recovering a shredded document that has numerous mixed pieces, the difficulty of the recovery process can be reduced by clustering, which is a method of grouping pieces that originally belonged to the same page. Restoring homologous shredded documents (pieces from different pages of the same file) is a frequent problem, and because these pieces have nearly indistinguishable visual characteristics, grouping them is extremely difficult. Clustering research has important practical significance for document recovery because homologous pieces are ubiquitous. Because of the wide usage of Chinese and the huge demand for Chinese shredded document recovery, our research focuses on Chinese homologous pieces. In this paper, we propose a method of completely clustering Chinese homologous pieces in which the distribution features of the characters in the pieces and the document layout are used to correlate adjacent pieces and cluster them in different areas of a document. The experimental results show that the proposed method has a good clustering effect on real pieces. For the dataset containing 10 page documents (a total of 462 pieces), its average accuracy is 97.19%. View Full-Text
- Supplementary File 1:
Supplementary (ZIP, 2840 KB)
Share & Cite This Article
Xing, N.; Zhang, J.; Cao, F.; Liu, P. Practical Challenge of Shredded Documents: Clustering of Chinese Homologous Pieces. Appl. Sci. 2017, 7, 951.
Xing N, Zhang J, Cao F, Liu P. Practical Challenge of Shredded Documents: Clustering of Chinese Homologous Pieces. Applied Sciences. 2017; 7(9):951.Chicago/Turabian Style
Xing, Nan; Zhang, Jianqi; Cao, Furong; Liu, Pengfei. 2017. "Practical Challenge of Shredded Documents: Clustering of Chinese Homologous Pieces." Appl. Sci. 7, no. 9: 951.
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.