SPARSE REPRESENTATION BASED INPAINTING FOR THE RESTORATION OF DOCUMENT IMAGES AFFECTED BY BLEED-THROUGH

Bleed-through is a commonly encountered degradation in ancient printed documents and manuscripts, which severely impair their readability. Digital image restoration techniques can be effective to remove or significantly reduce this degradation. In bleed-through document image restoration the main issue is to identify the bleed-through pixels and replace them with appropriate values, in accordance to their surroundings. In this paper, we propose a two stage method, where a pair of properly registered images of the document recto and verso is first used to locate the bleed-through pixels in each side, and then a sparse representation based image inpainting technique is used to fill-in the bleed-through areas according to the neighbourhood, in such a way to preserve the original appearance of the document. The advantages of the proposed inpainting technique over state-of-the-art methods are illustrated by the improvement in the visual results.


INTRODUCTION
Ancient documents are valuable sources of history, culture, and heritage information.Nevertheless, when the ancient manuscripts are written on both pages (recto and verso) of the sheet, often the ink penetrates through the page due to aging, humidity, ink chemical property or paper porosity, and becomes visible as degradation on the other side.This type of degradation is termed as bleed-through, and impairs legibility and interpretation of the document contents.In the past, physical restoration techniques were applied to remove the bleed-through degradation, but unfortunately they are costly, invasive, and can cause permanent damage to the document.Nowadays, ancient documents are usually archived in the form of digital images for preservation, distribution, retrieval, or analysis.Thus, digital image processing techniques have become a standard for their ability to perform any number This work was supported by the ERCIM Fellowship Program 2017.
of alterations to the document appearance, while leaving the original intact, and have been used with impressive results to remove bleed-through.Indeed, besides being necessary to improve readability, the removal of such degradation is a critical preprocessing step in many tasks such as optical character recognition (OCR), feature extraction, segmentation and automatic transcription.Generally, bleed-through restoration is handled as a classification problem, where image pixels are labeled as either background, bleed-through, or foreground (original text) [1].Broadly speaking, the approaches to bleed-through restoration can be divided into two categories: blind or single-sided and non-blind or double-sided.In blind methods only one side of the document is used, whereas non-blind methods exploit accurately pre-registered recto and verso images of the document.Most of the earlier blind methods involve an intensity based thresholding step [2].However, thresholding is not suitable when the aim is to preserve the original appearance of the document.In order to remove bleed-through selectively, in [3] an independent component analysis (ICA) based method is applied to an RGB image to separate the foreground, background, and bleed-through layers.A dual-layer Markov random field (MRF) is suggested in [4], whereas in [5] a conditional random field (CRF) based method is proposed.A non-blind ICA based method is outlined in [6].Other methods of this category are proposed in [7][8] [9].A common drawback of all the methods above is the search for an optimal way to replace the bleed-through pattern to be removed.In some of those papers, as a preliminary step, a "clean" background for the entire image is estimated (e.g.[4][9]), but this is usually a very laborious task.In this paper, we propose a two-step method to address bleedthrough document restoration from pre-registered recto and verso images.According to the point of view of document appearance preservation, the two steps consist in the selective identification of the bleed-through pattern, followed by its inpainting with the texture of the close background.Bleedthrough identification can be performed using, in principle, any of the methods proposed in the literature for this purpose.
Here we adopt the bleed-through cancellation algorithm de-scribed in [10], which is simple and very fast.Although efficient in locating bleed-through, also this algorithm suffers from the lack of a proper strategy to replace the unwanted pixels.Indeed, these are assigned with the predominant graylevel of the background.This causes imprints of the bleed-through pattern that are visible in the restored image.An interpolation based inpainting technique for such imprints is presented in [11], but the filled-in areas are mostly smooth.Here, in the second step of our method, we use a sparse image representation based inpainting to find a befitting fill-in for the bleed-through imprints, according to their neighborhood.This inpainting step, which constitutes the main contribution of the paper, enhances the quality of the bleed-through restored image and preserves the natural paper texture.The rest of this paper is organized as follows.Next section briefly introduces sparse image representation and dictionary learning.The non-blind bleed-through identification is presented in Section 3. Section 4 describes the proposed inpainting technique and the comparative results are discussed in Section 5. Section 6 presents some future prospects.

SPARSE DICTIONARY LEARNING
Sparse representation based modeling emerges as a powerful tool in the image processing field [12], with applications in compression, restoration, inpainting, clustering, recognition, and more.The underlying assumption in this direction is that images contain highly redundant information, therefore they can be approximated by sparse representations of prototype signal-atoms such that very few samples are sufficient to reconstruct a whole image [13].The objective in this case is to find a basis, called dictionary, which can be used to represent the given image data by a sparse combination of a few elements of the learned dictionary.Formally, for a given set of signals Y = [y 1 , y 2 , ..., y N ] where y i ∈ R n , the sparse representation consists in learning a suitable dictionary, D ∈ R n×K ( with n < K << N ), and the sparse vectors with minimal number of non-zero coefficients, X ∈ R K×N , such that y i ≈ Dx i , where x i ∈ X.In this setting, the sparse decomposition can be formulated as an optimization problem of the following form: where z is the desired sparsity level in the coefficients X and • p is the l p norm, with 0 ≤ p ≤ 1.An iterative-alternate approach is a natural way to find a solution, by minimizing with respect to one set of variables while keeping the other fixed until some convergence condition is fullfilled [14][13].The sparse vectors indicate the dictionary atoms that are used to generate each signal.Finding the exact sparse representation for signals, given a dictionary, is a NP-hard problem [15], however using approximate solutions has proven to be a good choice.Commonly used sparse approximation algorithms are Matching Pursuit (MP) [16], Basis Pursuits (BP) [17], Focal Undetermined System Solver (FOCUSS) [18] and Orthogonal Matching Pursuit (OMP) [19].The dictionary that can lead to sparse decomposition can either be selected from a set of pre-defined transforms like Wavelets, Fourier, Cosine etc., or can be adaptively learned from the set of training signals [13] [20].Although working with pre-defined dictionaries may be simple and fast, their performance might be not good for every case due to their global-adaptivity nature.Instead, learned dictionaries are adaptive to both the signals and the processing task at hand, thus resulting in far better performances [21] [22].For a given set of signals Y and dictionary D, the dictionary learning algorithms generate a representation of signal y i as a sparse linear combination of the atoms Dictionary learning algorithms distinguish themselves from traditional model-based methods by the fact that, in addition to x i , they also train the dictionary D to better fit the data set Y. The solution is generated by iteratively alternating between the sparse coding stage for i = 1, ..., N and the dictionary update stage for the X obtained from the sparse coding stage Dictionary learning algorithms are often sensitive to the choice of z.The update step can either be sequential as in [14] or parallel as in [23].In sequential dictionary learning, the dictionary update minimization problem (4) is split into K sequential minimizations, optimizing the cost function (4) for each individual atom while keeping fixed the remaining atoms.In the method proposed in [14], which has become a benchmark in dictionary learning, each column d k of D and its corresponding row of coefficients x row k are updated based on a rank-1 matrix approximation of the error for all the signals when d k is removed where . The singular value decomposition (SVD) of E k = U∆V is used to find the closest rank-1 matrix approximation of E k .The d k update is taken as the first column of U and the x row k update is taken as the first column of V multiplied by the first element of ∆.To avoid the loss of sparsity in x row k that will be created by the direct application of the SVD on E k , in [14] it was proposed to modify only the non-zero entries of x row k resulting from the sparse coding stage.This is achieved by taking into account only the signals y i that use the atom d k in (5) or by taking the SVD of E R k = E k I w k , where w k = {i|1 ≤ i ≤ N ; x row k (i) = 0} and I w k is the N × |w k | submatrix of the N × N identity matrix obtained by retaining only those columns whose index numbers are in w k , instead of the SVD of E k .

BLEED-THROUGH IDENTIFICATION
Provided that the two sides of the manuscript have been written with the same ink and at two close moments, in the majority of the cases the bleed-through text is almost everywhere lighter than the foreground text.Nonetheless, the intensity of bleed-through is usually very variable, and this makes hard to find suitable thresholds to selectively identify and remove it.Indeed, once again we stress the fact that thresholding would likely destroy also the background texture and other slight document marks that genuinely belong to the side at hand.Thus, in [10] we propose to selectively identify bleed-through and discriminate it from other desired features by looking for those pixels that are lighter than the corresponding pixels in the opposite side.Three main exceptions to this basic rule are however accounted for.The first one concerns the occlusions, i.e. those areas where the two texts overlap, and that should not be identified as bleed-through in none of the two sides.In these pixels the intensity (graylevel) is the lowest and almost equal in the two corresponding pixels of the recto and verso sides.Therefore, at present we identify the occlusion areas with the aid of suitable thresholds.A specular situation occurs for pixels of background in both sides.Here the intensity is the highest and almost equal in the two sides.Again, proper thresholds can help to discriminate this situation.A third case generates from the typical phenomenon of diffusion of the ink that penetrates through the paper.It may thus happen that, especially at the border of the bleed-through characters, the pixel intensity is lower than that of the (background) pixel in the opposite side, since the text causing the degradation is sharper.To cope with this effect, rather than simply compare the two recto and verso intensities, we compare the intensity in one side with the "smeared" version of the intensity in the opposite side.Therefore, at each pixel except those identified as occlusions or background, we apply the following formulas to compute the two "seeping levels" q r and q v : where the subscripts r, v indicate recto and verso, respectively, s(i, j) is the intensity at pixel (i, j), b is an estimate of the predominant graylevel in the background, and the small constant is included to avoid indeterminacies or infinity.These formulas comprise two Point Spread Functions (PSF) of unit volume, h r and h v , which describe the smearing of the seeping ink, thus allowing a pattern in a side to better match the corresponding one in the opposite side.We assume h r and h v as two Gaussian functions, with size and standard deviation approximately estimated from the extent of the character smearing.At each pixel, we retain the smallest between the two "seeping levels" computed with eq.6, to indicate a bleed-through pixel in the related side, and set to zero the other, to indicate a foreground pixel in the opposite side.
In [10], the strategy described above is applied, with similar performance, using the optical density rather than the intensity, where the optical density is interpreted as a sort of "quantity" of ink, and with reference to a linear additive, nonstationary model to describe the two observed recto and verso densities.The restoration algorithm is completed by substituting the pixels having positive "seeping level" with the background predominant graylevel b, and leaving unchanged the areas where q = 0.The experimental results proved the efficiency of the algorithm in removing bleed-through while leaving unaltered other salient marks, such as stamps, pencil annotations, paper textures, etc.These marks can be of paramount importance for the scholar studying the document contents and its origin.

BLEED-THROUGH IMPRINT INPAINTING
To each side separately, a dictionary based inpainting method is applied to find a suitable fill-in to replace the identified bleed-through pixels.Our aim is to preserve the background texture to ensure the original look of the document.A patch based sparse representation model is used for image inpainting, with a learned dictionary.Mathematically, the image inpainting problem is to reconstruct the underlying complete image (lexicographically ordered) C ∈ R N from its observed incomplete version I ∈ R L , where L < N .We assume a sparse representation of C over a dictionary D ∈ R n×K : C ≈ DX.The incomplete image I and the complete image C are related through I = MC, where M ∈ R L×N represents the layout of the missing pixels.The image inpainting problem can be formulated as: Assuming that a well trained dictionary D is available, the problem boils down to the estimation of sparse coefficients X such that the underlying complete image Ĉ is given by Ĉ = D X.To learn the dictionary D a training set Y is created by extracting overlapping patches of size √ p s × √ p s from the image at location j = 1, 2, ., P , where P is the total number of patches.Then we have y j = R j C, where R j (.) is an operator that extracts the patch y j from the image y , and its transpose, denoted by R T j (.), is able to put back a patch into the j th position in the reconstructed image.Considering that patches are overlapped, the recovery of C from {y j } can be obtained by averaging all the overlapped patches.We learned a dictionary D ∈ R ps×K from the training set Y, using the method described in Section 2. For optimization we used only complete patches from Y, i.e. the patches with no bleed-through imprint pixels.This choice speeds up the training process and excludes the 'non-informative' imprint pixels.For each patch y j to be inpainted, we first search for its mutual similar patches in a W × W bounded neighborhood window.Similar to [24] we used block matching technique with Euclidean distance metric as similarity criterion.The similar patches are grouped together in a matrix Ω ∈ R ps×c , where c is the total number of similar patches.Using the learned dictionary D and Ω, a grouped-based sparse representation method similar to [25] is used to find the befitting fill-in for the bleed-through imprints.The use of the similar patch group Ω incorporates the local information in the sparse estimation of the patch and preserves the natural look of the document.

EXPERIMENTAL RESULTS
We compared the proposed method with other state-of-the-art methods for images of the database of 25 manuscript rectoverso pairs, presented in [26] [27].The input image with bleed-through is first processed for bleed-through detection and removal as discussed in Section 3.For imprints inpainting, the training data set Y is constructed by selecting all the overlapping patches of size 8 × 8 from the input image.We learned a dictionary D of size 64 × 256 from Y, with the sparsity level z = 3 and overcomplete DCT as initial dictionary.For each patch to be inpainted, the sparse coefficients are estimated under the learned dictionaries and the similar patches in its 25 × 25 neighbourhood window, as discussed in Section 4. The estimated sparse coefficient vector of each patch is denoted by x j , where j indicates the number of the patch.The reconstructed patch is then obtained as ŷj = Dx j .In bleed-through restoration the efficacy is generally compared qualitatively, as in most cases the original clean image is not available.A visual comparision of the proposed method with other state-of-the-art methods is presented in Fig. 1.As can be seen, the proposed method (Fig. 1 (e)) produces a better result compared with the other methods, especially in the area bounded by the red rectangle.The dual-layer MRF approach with user trained likelihood of [7] (Fig. 1 (b)) flattens the background and removes the original text in the overlapping cases.The wavelet based method in [8] (Fig. 1 (c)) produces a binary image and fails to handle darker bleed-through.The non-parametric method of [9] (Fig. 1 (d)) produces better results but again fails in some dark bleed-through regions.
The proposed method copes very well with bleed-through removal and the dictionary based inpainting preserves the original look of the document.

CONCLUSIONS
We plan to extend the sparse representation based inpainting method to manage the cases where some occlusions are wrongly identified as bleed-through, and where some bleedthrough strokes are as dark as the foreground text.As per the first case, a proper dictionary of foreground patches must be additionally learned.In both cases, a finer criterion, e.g. based on a measure of proximity, needs to be found to discriminate dark strokes that belong to the bleed-through text from those that belong to the foreground text.