A New Binarization Algorithm for Historical Documents

Monochromatic documents claim for much less computer bandwidth for network transmission and storage space than their color or even grayscale equivalent. The binarization of historical documents is far more complex than recent ones as paper aging, color, texture, translucidity, stains, back-to-front interference, kind and color of ink used in handwriting, printing process, digitalization process, etc. are some of the factors that affect binarization. This article presents a new binarization algorithm for historical documents. The new global filter proposed is performed in four steps: filtering the image using a bilateral filter, splitting image into the RGB components, decision-making for each RGB channel based on an adaptive binarization method inspired by Otsu’s method with a choice of the threshold level, and classification of the binarized images to decide which of the RGB components best preserved the document information in the foreground. The quantitative and qualitative assessment made with 23 binarization algorithms in three sets of “real world” documents showed very good results.


Introduction
Document image binarization plays an important role in the document image analysis, compression, transcription, and recognition pipeline [1].Binary documents claim for far less storage space and computer bandwidth for network transmission than color or grayscale documents.Historical documents drastically increase the degree of difficulty for binarization algorithms.Physical noises [2] such as stains and paper aging affect the performance of binarization algorithms.Besides that, historical documents were often typed, printed or written on both sides of sheets of paper and the opacity of the paper is often such as to allow the back printing or writing to be visualized on the front side.This kind of "noise", first called back-to-front interference [3], was later known as bleeding or show-through [4].Figure 1 presents three examples of documents with such a noise extracted from the three different datasets used in this paper in the assessment of the proposed algorithm.If the document is exhibited either in true-color or gray-scale, the human brain is able to filter out that sort of noise keeping its readability.The strength of the interference present varies with the opacity of the paper, its permeability, the kind and degree of fluidity of the ink used, its storage, age, etc.Thus, the difficulty for obtaining a good binarization performance capable of filtering-out such a noise increases enormously, as a new set of hues of paper and printing colors appear.The direct application of binarization algorithms may yield a completely unreadable document, as the interfering ink of the backside of the paper overlaps with the binary one in the foreground.Several document image compression schemes for color images are based on "adding color" to a binary image.Such compression strategy is unable to handle documents with back-to-front interference [5].Optical Character Recognizers (OCRs) are also unable to work properly for such documents.Several algorithms were developed specifically to binarize documents with back-to-front interference [3,4,[6][7][8][9].There is no binarization technique to be an all case winner as many parameters may interfere in the quality of the resulting image [9].The development of new binarization algorithms is still an important research topic.International competitions on binarization algorithms, such as DIBCO -Document Image Binarization Competition [10], are an evidence of the relevance of this area.back-to-front interference [5].Optical Character Recognizers (OCRs) are also unable to work properly for such documents.Several algorithms were developed specifically to binarize documents with back-to-front interference [3,4,[6][7][8][9].There is no binarization technique to be an all case winner as many parameters may interfere in the quality of the resulting image [9].The development of new binarization algorithms is still an important research topic.International competitions on binarization algorithms, such as DIBCO -Document Image Binarization Competition [10], are an evidence of the relevance of this area.This paper presents a new global filter [1] to binarize documents, which is able to remove the back-to-front noise in a wide range of documents.Quantitative and qualitative assessments made in a wide variety of documents from three different "real-world" datasets (typed, printed and handwritten, using different kinds of paper, ink, etc.) allow to witness the efficiency of the proposed scheme.

The New Algorithm
The algorithm proposed here is performed in four steps: 1. decision-making for finding the vector of parameters of the image to be filtered, 2. filtering the image using a bilateral filter, 3. splitting the image into the RGB components, and performing their binarization using a method inspired by Otsu's algorithm for each RGB channel, and 4. choice of which of the RGB components best preserved the document information in the foreground, which is considered the final output of the algorithm.Figure 2 presents the block diagram of the proposed algorithm.The functionality of each block is detailed as follows.This paper presents a new global filter [1] to binarize documents, which is able to remove the back-to-front noise in a wide range of documents.Quantitative and qualitative assessments made in a wide variety of documents from three different "real-world" datasets (typed, printed and handwritten, using different kinds of paper, ink, etc.) allow to witness the efficiency of the proposed scheme.

The New Algorithm
The algorithm proposed here is performed in four steps: 1. decision-making for finding the vector of parameters of the image to be filtered, 2. filtering the image using a bilateral filter, 3. splitting the image into the RGB components, and performing their binarization using a method inspired by Otsu's algorithm for each RGB channel, and 4. choice of which of the RGB components best preserved the document information in the foreground, which is considered the final output of the algorithm.Figure 2 presents the block diagram of the proposed algorithm.The functionality of each block is detailed as follows.

The Decision Making Block
The decision making block takes as input the image to be binarized and outputs a vector with four parameters: the value of the kernel (kernel) for the bilateral filter and three threshold values (tR, tG, tB) that will be later used in the modified Otsu filtering.
The training of the binarization process proposed here is made with synthetic images which were generated as explained in Section 2.2.After filtering, the matrix of co-occurrence probabilities between the original image and of the binary image was calculated for each of the images in the document training set, whose generation is explained below.
The probabilistic structure applied in the analysis to each of the images in the training set is similar to the transmission of binary data in a Binary Asymmetric Channel, as shown in Figure 3.The probabilities P(f/b) and P(b/f) represent an additive noise in communication channels in information theory, here it represents the inability of the algorithm to correct the back-to-front interference of the image tested in the binarization process.The probabilities P(b/b) and P(f/f) are calculated from the pixel-to-pixel comparison of the binarized image generated by the proposed algorithm with the ground-truth image.

The Decision Making Block
The decision making block takes as input the image to be binarized and outputs a vector with four parameters: the value of the kernel (kernel) for the bilateral filter and three threshold values (t R , t G , t B ) that will be later used in the modified Otsu filtering.
The training of the binarization process proposed here is made with synthetic images which were generated as explained in Section 2.2.After filtering, the matrix of co-occurrence probabilities between the original image and of the binary image was calculated for each of the images in the document training set, whose generation is explained below.
The probabilistic structure applied in the analysis to each of the images in the training set is similar to the transmission of binary data in a Binary Asymmetric Channel, as shown in Figure 3.The probabilities P(f/b) and P(b/f) represent an additive noise in communication channels in information theory, here it represents the inability of the algorithm to correct the back-to-front interference of the image tested in the binarization process.The probabilities P(b/b) and P(f/f) are calculated from the pixel-to-pixel comparison of the binarized image generated by the proposed algorithm with the ground-truth image.

The Decision Making Block
The decision making block takes as input the image to be binarized and outputs a vector with four parameters: the value of the kernel (kernel) for the bilateral filter and three threshold values (tR, tG, tB) that will be later used in the modified Otsu filtering.
The training of the binarization process proposed here is made with synthetic images which were generated as explained in Section 2.2.After filtering, the matrix of co-occurrence probabilities between the original image and of the binary image was calculated for each of the images in the document training set, whose generation is explained below.
The probabilistic structure applied in the analysis to each of the images in the training set is similar to the transmission of binary data in a Binary Asymmetric Channel, as shown in Figure 3.The probabilities P(f/b) and P(b/f) represent an additive noise in communication channels in information theory, here it represents the inability of the algorithm to correct the back-to-front interference of the image tested in the binarization process.The probabilities P(b/b) and P(f/f) are calculated from the pixel-to-pixel comparison of the binarized image generated by the proposed algorithm with the ground-truth image.The background-background probability is a function that needs to be optimized in the decision-making block, mapping background pixels (paper) from the original image onto white pixels of the binary image.It depends of all the parameters of the original image texture, strength of the back to front interference (simulated by the coefficient α), paper translucidity, etc. for each RGB channel.Thus, one can represent this dependence as: The optimal threshold t c * for each channel is calculated in the decision-making block, the index c can be R, G or B, maximizing P(b/b): subject to a given criterion P(f/f) ≥ M. The criterion used here was M = 97%, that is at most 3% of the foreground pixels may be incorrectly mapped.During the training phase, the best t c * will be chosen from the three channels, which best maximizes the P(b/b) for each of the images in the training set.The matrix of co-occurrence probability is calculated and the decision maker chooses the best binary image.The decision-making block was trained with 32,000 synthetic images in such a way to, given a real image to be binarized, it finds the optimal threshold parameters.

Generating Synthetic Images
The Decision-Making Block needs training to "learn" about the optimal threshold parameters and the value of the kernel to be used in the bilateral filter.Such training must be done using controlled images which are synthesized to mimic the different degrees of back-to-front interference, paper aging, paper translucidity, etc. Figure 4 presents the block diagram for the generation of synthetic images.Two binary images of documents of different nature (typed, handwritten with different pens, printed, etc.) are taken: F-front and V-verso (back).The front image is blurred with a weak Gaussian filter to simulate the digitalization noise [1], the hues that appear in after document scanning.
The background-background probability is a function that needs to be optimized in the decision-making block, mapping background pixels (paper) from the original image onto white pixels of the binary image.It depends of all the parameters of the original image texture, strength of the back to front interference (simulated by the coefficient α), paper translucidity, etc. for each RGB channel.Thus, one can represent this dependence as: The optimal threshold tc* for each channel is calculated in the decision-making block, the index c can be R, G or B, maximizing P(b/b): subject to a given criterion P(f/f) ≥ M. The criterion used here was M = 97%, that is at most 3% of the foreground pixels may be incorrectly mapped.During the training phase, the best tc* will be chosen from the three channels, which best maximizes the P(b/b) for each of the images in the training set.
The matrix of co-occurrence probability is calculated and the decision maker chooses the best binary image.The decision-making block was trained with 32,000 synthetic images in such a way to, given a real image to be binarized, it finds the optimal threshold parameters.

Generating Synthetic Images
The Decision-Making Block needs training to "learn" about the optimal threshold parameters and the value of the kernel to be used in the bilateral filter.Such training must be done using controlled images which are synthesized to mimic the different degrees of back-to-front interference, paper aging, paper translucidity, etc. Figure 4 presents the block diagram for the generation of synthetic images.Two binary images of documents of different nature (typed, handwritten with different pens, printed, etc.) are taken: F-front and V-verso (back).The front image is blurred with a weak Gaussian filter to simulate the digitalization noise [1], the hues that appear in after document scanning.The verso image is "blurred" by passing through two different Gaussian filters that simulate the low-pass effect of the translucidity of the verso as seen in the front part of the paper.Two different parameters were used to simulate two different classes of paper translucidity.The "blurred" verso image is now faded with a coefficient α varying between 0 and 1 in steps of 0.01.Then, a circular shift of the lines of the document is made of either 5 or 10 pixels, to minimize the chances of the front and verso lines coincide entirely.Finally, the two images are overlapped by performing a "darker" operation pixel-by-pixel in the images.Paper texture is added to the image to simulate the effect of document aging.The texture pattern was extracted from document from late 19th century to the year 2000.The analysis of 3450 documents representative of a wide variety of documents of such a period was analyzed yielding 100 different clusters of textures.The synthetic texture to be applied to the image to simulate paper aging is generated using those 100 clusters by image quilting [11] and randomly, as explained in reference [9].The training performed in the current version of the presented algorithm was made with 16 of those 200 synthetic textures.The total number of images used for training here was thus 16 (textures), times 10 (0 < α < 1 in steps of 0.10), times 2 blur parameters for the Gaussian filters, times 100 different binary images, totaling 32,000 images.Details of the full generation process of the synthetic image database are out of the scope of this paper and may be found in reference [9].

The Bilateral Filter
The bilateral filter was first introduced by Aurich and Weule [12] under the name "nonlinear Gaussian filter".It was later rediscovered by Tomasi and Manduchi [13] who called it the "bilateral filter" which is now the most commonly used name according to reference [14].
The bilateral filter is a technique to smoothen images while preserving their edges.The filter output at each pixel is a weighted average of its neighbors.The weight assigned to each neighbor decreases with both the distance values among pixels of the image plane (the spatial domain S) and the distance on the intensity axis (the range domain R).The filter applies spatial weighted averaging without smoothing the edges.It combines two Gaussian filters; one filter works in the spatial domain, while the other filter works in the intensity domain.Therefore, not only the spatial distance but also the intensity distance is important for the determination of weights.The bilateral filter combines two stages of filtering.These are the geometric closeness (i.e., filter domain) and the photometric similarity (i.e., filter range) among the pixels in a window of size N × N. Let I(x,y) be a 2D discrete image of size N × N, such that {x,y} ∈ {0, 1, ..., N − 1} X {0, 1, ..., N − 1}.Assume that I(x,y) is corrupted by an additive white Gaussian noise of variance σ 2 n .For a pixel (x,y), the output of a bilateral filter can be as described by Equation (1): where I(x,y) is the pixel intensity in the image before applying the bilateral filter, I BF (x,y) is the resulting pixel intensity after applying the bilateral filter and d is a non-negative integer such that (2d + 1) × (2d + 1) stands for the size of the neighborhood window.Let G s and G r be the domain and the range components, respectively, which are defined as: and G r (I(i, j); I(x, y)) = e − |I(i,j)−I(x,y)| 2 2σ 2 r (5) The normalization constant K is given as: Equations ( 4) and (5) show that the bilateral filter has three parameters: σ 2 s (the filter domain), σ 2 r (the filter range), and the third parameter is the window size N × N [15].The geometric spread of the bilateral filter is controlled by σ 2 s .If the value of σ 2 s is increased, more neighbours are combined in the diffusion process yielding a "smoother" image, while σ 2 r represents the photometric spreading.Only pixels with a percentage difference of less than σ 2 r are processed [13].

Otsu Filtering
After passing through the bilateral filter, the image is split into its original (non-gamma corrected) Red, Green and Blue components, as shown in the block diagram in Figure 2. The kernel of the bilateral filter alters the balance of the colors in the original image in such a way to widen the differences between the color of the front and back-to-front interference.A modified version of Otsu [16] algorithm is applied to each RGB channel using the thresholds determined by the Decision Making Block, which may be considered as the "optimal" threshold for each RGB channel, and then three binary images are generated.

Image Classification
The image classification block was also trained with the synthetic images in such a way to analyze the three binary images generated in each of the channels and outputs the one that is considered the best one.This decision was also made by a naïve Bayes automatic classifier which was trained using the calculated co-occurrence matrix for each of the 32,000 synthetic images by comparing each of them with the original ground truth image, the Front image.

Experiments and Results
As already explained, the enormous variety of kinds of text documents makes extremely improbable that one single algorithm is able to satisfactorily binarize all kinds of documents.Depending on the nature (or degree of complexity) of the image several or no algorithm will be able to provide good results.This paper follows the assessment methodology proposed in reference [9], in which one compares the numbers of background and foreground pixels correctly matched with a ground-truth image.Twenty-three binarization algorithms were tested using the methodology described: DaSilva-Lins-Rocha [6] 3.
MinError [21] 10.Mixture-Modeling [22] 11.Moments [23] 12. IsoData [24] 13.Percentile [25] 14.Pun [26] 15.Shanbhag [27] 16.Triangle [28] 17.Wu-Lu [29] 18. Yean-Chang-Chang [30] 19.Intermodes [31] 20.Minimum (variation of [31]) 21.Ergina-Local [32] 22. Sauvola [33] 23.Niblack [34] A ground-truth image for each "real" world one is needed to allow a quantitative assessment of the quality of the final binary image.Only the DIBCO dataset [10] had ground-truth images available.This makes the assessment task of real-world images extremely difficult [35].All care must be taken to guarantee the fairness of the process.The ground-truth images for the other datasets were generated by applying the 23 algorithms above and the bilateral algorithm to all the test images in the Nabuco [7] and LiveMemory [36] datasets.Visual inspection was made to choose the best binary image in a blind process, a process in which the people who selected the best image did not know which algorithm generated it.To increase the degree of fairness and the number of filtering possibilities, the three component images produced by the Decision Making block were all analyzed.The binary images chosen using the methodology above went through salt-and-pepper filtering and were used as ground-truth image for the assessment below.All the processing time figures presented in this paper are from Intel i7-4510U@ 2.00 GHzx2, 8 GB RAM, running Linux Mint 18.2 64-bit.All algorithms were coded in Java, possibly by their authors.

The Nabuco Dataset
The Nabuco bequest encompasses about 6500 letters and postcards written and typed by Joaquim Nabuco [7], totaling about 30,000 pages.Such documents are of great interest to whoever studies the history of the Americas, as Nabuco was one of the key figures in the freedom of black slaves, and was the first Brazilian Ambassador to the U.S.A.The documents of Nabuco were digitalized by the second author of this paper and the historians of the Joaquim Nabuco Foundation using a table scanner in 200 dpi resolution in true color (24 bits per pixel), back in 1992 to 1994.Due to serious storage limitations then, images were saved in the jpeg format with 1% loss.The historians in the project concluded that 150 dpi resolution would suffice to represent all the graphical elements in the documents, but choice of the 200-dpi resolution was made to be compatible with the FAX devices widely used then.About 200 of the documents in the Nabuco bequest exhibited back-to-front interference.The 15 document images used in this dataset were chosen for being representative of the diversity of documents in such a universe.
Table 1 presents the quantitative results obtained for all the documents in this dataset.P(f/f) stands for the ratio between the number of foreground pixels in the original image mapped onto black pixels and the number of black pixels in the ground-truth image.Similarly, P(b/b) is proportion between the number of background pixels in the original image mapped onto white pixels of the binary image and the number of white pixels in the ground-truth image.The figures for P(b/b) and P(f/f) are followed by "±" and the value of the standard deviation.The time corresponds to the mean processing time elapsed by the algorithm to process the images in this dataset.The results were ranked in P(b/b) decreasing order.
The results presented in Table 1 shows the bilateral filter in third place for this dataset in terms of image quality, however the standard deviation is much lower than the two first.That implies that its quality is more stable for the various document images in this dataset.Figure 5 presents the document for which the bilateral filter presented the best and the worst results in terms of image quality with two zoomed areas from the original and the binarized document.

The LiveMemory Dataset
This dataset encompasses 15 documents with 200 dpi resolution selected from the over 8,000 documents from the LiveMemory project that created a digital library with all the proceedings of technical events from the Brazilian Telecommunications Society.The original proceedings were offset printed from documents either typed or electronically produced.Table 2 presents the performance results for the 12 best ranked algorithms.The bilateral filter obtained the best results in terms of image filtering.It is worth observing that in the case of the worst quality image (Figure 6, right) the performance degraded for all the algorithms.This behavior is due to the shaded area in the hard-bound spine of the volumes of the proceedings.

The LiveMemory Dataset
This dataset encompasses 15 documents with 200 dpi resolution selected from the over 8000 documents from the LiveMemory project that created a digital library with all the proceedings of technical events from the Brazilian Telecommunications Society.The original proceedings were offset printed from documents either typed or electronically produced.Table 2 presents the performance results for the 12 best ranked algorithms.The bilateral filter obtained the best results in terms of image filtering.It is worth observing that in the case of the worst quality image (Figure 6, right) the performance degraded for all the algorithms.This behavior is due to the shaded area in the hard-bound spine of the volumes of the proceedings.

The DIBCO Dataset
This dataset has all the 86 images from the Digital Image Binarization Contest from 2009 to 2016.Table 3 presents the results obtained.The performance of the bilateral filter in this set may be considered good, in general.The overall performance of the bilateral filter was strongly degraded by the single image shown in Figure 7 (right) in which the P(f/f) of 25.93 drastically dropped the average result of the algorithm in this test set.It is important to remark that such an image is almost unreadable even for humans and that it degraded the performance of all the best algorithms.

Conclusions
Historical documents are far more difficult to binarize as several factors such as paper texture, aging, thickness, translucidity, permeability, the kind of ink, its fluidity, color, aging, etc. all may

The DIBCO Dataset
This dataset has all the 86 images from the Digital Image Binarization Contest from 2009 to 2016.Table 3 presents the results obtained.The performance of the bilateral filter in this set may be considered good, in general.The overall performance of the bilateral filter was strongly degraded by the single image shown in Figure 7 (right) in which the P(f/f) of 25.93 drastically dropped the average result of the algorithm in this test set.It is important to remark that such an image is almost unreadable even for humans and that it degraded the performance of all the best algorithms.

Figure 1 .
Figure 1.Images with back-to-front interference from the three test sets used in this paper: Nabuco bequest (left), LiveMemory (center) and DIBCO (right).

Figure 1 .
Figure 1.Images with back-to-front interference from the three test sets used in this paper: Nabuco bequest (left), LiveMemory (center) and DIBCO (right).

Figure 2 .
Figure 2. Block diagram of the proposed algorithm.

Figure 3 .
Figure 3. Generation of the co-occurrence matrix for each of the images in the training set.

Figure 2 .
Figure 2. Block diagram of the proposed algorithm.

J 12 Figure 2 .
Figure 2. Block diagram of the proposed algorithm.

Figure 3 .
Figure 3. Generation of the co-occurrence matrix for each of the images in the training set.

Figure 3 .
Figure 3. Generation of the co-occurrence matrix for each of the images in the training set.

Figure 4 .
Figure 4. Block diagram of the scheme for the generation of synthetic images for the Decision-Making Block.Figure 4. Block diagram of the scheme for the generation of synthetic images for the Decision-Making Block.

Figure 4 .
Figure 4. Block diagram of the scheme for the generation of synthetic images for the Decision-Making Block.Figure 4. Block diagram of the scheme for the generation of synthetic images for the Decision-Making Block.

Table 1 .
Binarization results for images from Nabuco bequest.

Table 2 .
Binarization results for images from the LiveMemory project.

Table 2 .
Binarization results for images from the LiveMemory project.