1. Introduction
Historical and archival manuscripts often suffer damage from a variety of factors, primarily due to the natural degradation of materials over time and the conditions in which they are conserved. The fundamental requirement for accessing these manuscripts is the removal of degradation to ensure the main text is fully understandable.
Binarization of degraded documents can effectively separate the primary text from complex background patterns that need to be removed entirely [
1,
2,
3,
4,
5]. However, this process may not be sufficient for a complete appreciation of the manuscript. First, binarization produces a two-class, black-and-white image, inevitably losing important details that the manuscript may contain, such as annotations, miniatures, watermarks, and drawings, which should be preserved for their historical and informational value. Second, binarization struggles with severe degradation, particularly when the interference pattern is nearly as intense as the primary text. Thus, it is crucial to strike a proper balance between removing unnecessary and harmful elements while preserving the primary text and enhancing features that, although they are not directly related to the primary text, are significant to the manuscript’s history.
Ink bleed-through is one of the most common and damaging forms of degradation found in ancient manuscripts. This occurs when both sides of the paper are written on, resulting in the two texts appearing intertwined on both sides, albeit with varying intensities. Such degradation can be attributed to poor conservation conditions, water infiltration, or the natural composition of the materials. Methods designed to reduce bleed-through are categorized into blind methods, which utilize information from one side only [
6,
7], and non-blind methods, which take advantage of the two sides of the manuscript page often being available [
8,
9,
10].
Non-blind methods can achieve highly effective virtual restoration, but they require precise alignment of the two images as a trade-off [
11,
12,
13].
In a previous study (see [
14]), we proposed a simple multilayer shallow neural network with backpropagation training [
15] for addressing non-blind cases. We designed the neural network to auto-adapt to the manuscript being enhanced, meaning it does not require prior training on a large set of similar manuscripts that have already been classified. Among various potential solutions, we utilized a data model to generate simulated degradation samples for the training phase. This data model provided an approximate description of the degradation when recto–verso digitizations of the manuscripts are available [
10]. A training set was created using ground truths from the clean areas of the manuscript and subsequently mixed according to the model. The experimental results presented in [
14] on heavily damaged manuscripts are encouraging regarding degradation removal.
We accounted for variable degradation, including very severe cases. This enables our neural network, developed from a single exemplar manuscript, to potentially perform effectively on other manuscripts within the same corpus that exhibit different levels of degradation or on different pages of the same book.
However, despite these encouraging results, we encountered significant challenges with high and highly variable levels of bleed-through.
When bleed-through is particularly strong, its gray-level values can closely resemble those of the text, making it difficult, if not impossible, to distinguish it from the primary text without additional information. Specifically, although we know the true classification of each pixel during the training phase, a neural network trained on examples with similar features but different targets will produce random responses. This can lead to the unintended cancellation of true text where it overlaps with the opposite text (occlusion) or, conversely, the preservation of noisy bleed-through pixels, which are mistakenly recognized as text.
In addition to being intense, the level of ink penetration can exhibit significant spatial variability within the same manuscript due to localized factors, such as humidity. This may result in spurious spikes of extreme degradation within otherwise uniform areas of moderate degradation. The consequences again include the potential cancellation of true text or the erroneous preservation of bleed-through.
In conference paper [
16], we devised some modifications to the network architecture and learning to cope with the above issues better and with occlusion in particular. We proposed adding an extra output class for overlapping text pixels in order to distinguish them from the ordinary foreground text pixels and explicitly included the conditions for occlusion to occur in the data model so that the construction of the training dataset was coherent with the inclusion of the new class. We discussed the improvement produced by these modifications from a qualitative point of view on both synthetic and real degraded manuscripts and compared the classification obtained by our NN with that obtained by state-of-the-art binarization algorithms.
In this paper, we further extend the work presented in [
16]. First of all, we systematically assess the results of the method through a quantitative analysis on a popular dataset. Then, we try to improve the NN’s classification further by also taking into account the context of the pixel to be classified. Our point of view is that it may be reasonable to favor, a priori, the same classification for pixels that are close to each other spatially considering text images are locally homogeneous. More specifically, the classification of a pixel should also take into account its position within the text: in the center of a stroke, at its edge, or completely outside it. This information can partially be derived from the average characteristics of adjacent pixels, to be used as a further attribute of the current pixel. In the simplest way, each pixel can be thought as connected to its
neighborhood, which can predict or give insights into a pixel’s nature.
So far, our NN has worked pointwise and has considered the value of optical density as the primary attribute of the pixel so that we have two features corresponding to each pixel, namely the optical densities of the front and back observations of that pixel. In view of the above considerations, here, we try to account for the local information and extend the method by adding extra features, namely the two average values, one at the front and the other at the back, of the optical densities of 8-neighboring pixels. Our aim is to show that accounting for different contexts can explain how pixels with similar pointwise densities may come with different classifications, thus helping us to resolve ambiguity. The experiments confirm that the final NN architecture with four classes and four features is superior to the NN architectures we proposed previously.
This paper is organized as follows. In
Section 2, “Materials and Methods”, we first describe the method adopted for the construction of the adaptive training set. This method exploits the idea of using a mathematical model to generate artificial examples by extending a data model we previously proposed for the peculiar degradation treated in this research. In this section, we also provide the operative details of the shallow NN architecture and the learning and recall phases.
Section 3 analyzes and discusses the experimental results for both synthetic and real cases, showing the improvement that can be obtained using the proposed extended NN architecture. Finally,
Section 4 concludes this paper.
2. Materials and Methods
The first step of the overall process of enhancing historical recto–verso manuscripts is to classify the pixels of each side into four different classes, which we call
foreground,
background,
bleed-through, and
occlusion, respectively. These classes represent the main text; the clean paper texture with, eventually, other marks; the seeping ink; and the areas in which the two sides have both been written on and the two texts overlap. In the previous work [
14], we only considered three classes by merging occlusion pixels with text pixels. This reflects the appearance of each side individually, where the occlusions are actually text for that side and without knowledge of the opposite side, cannot be identified with certainty. As we will see in the experimental results, using only three classes resulted in an overestimation of the bleed-through class.
As a classifier, we propose a neural network (NN) that requires a training set with ground truths in order to learn to distinguish between pixels. Each pixel is described by four features: the two densities on the two sides of the manuscript and the two average densities of its 8-adjacent pixels, always on the two sides. In previous works on this subject, we only considered the two essential density values of a pixel. This left some ambiguities in the classification of pixels that, despite belonging to a homogeneous area, had anomalous density values, caused by large variability in the degradation or the inhomogeneity of the materials.
The overall workflow of the process of recto–verso manuscript enhancement is illustrated in
Figure 1. In this diagram, the focus is on the recto side only.
2.1. Construction of the Training Set
As mentioned, for the training phase, we do not use an external dataset based on similar manuscripts that have already been classified, but our neural network is trained using the manuscript we want to classify.
Thus, to build the training set, in the manuscript, we select N pairs of patches containing clean text and then symmetrically mix them using a data model for seeping ink that describes the observed optical density of each side as the weighted sum of the ideal densities of the two sides.
We define the optical density of pixel
t as the minus log of the normalized intensity, i.e.,
, with
being the intensity and
p the mean intensity value of the paper support. This normalization allows the density to be independent of the color of the paper on the two sides. Based on this definition, the model is expressed in the following way:
where
x and
y indicate the two sides, which must be perfectly aligned after reflection of one of the two. Using Equation (
1) for the opposite side, the roles of
x and
y are exchanged. In Equation (
1),
and
D are the observed and ideal optical density, respectively, and ⊗ indicates the convolution between the ideal intensity
s and a Point Spread Function (PSF),
h, describing the smearing of ink penetrating the paper. Finally, the space-variant quantities
and
, whose allowed range is
, represent the percentages of ink penetration from one side to the other. The first condition in the model Equation (
1) means that we assume that the density of the foreground text does not increase due to ink seepage, as applies in the majority of the cases.
In previous works (see, e.g., [
10]), we neglected the ink saturation effect and proposed inverting the equation in the second condition of the model to virtually restore the recto–verso pair. To make the inversion possible, we assumed that the hyperparameters
q and
h were known in advance. Based on the observed densities of the two sides, we first inverted the model by assuming an identically zero ideal density on the opposite side, thus obtaining estimates of the ink penetration percentages for each pixel. The system could then be solved with respect to the ideal density maps, from which the virtually restored manuscript sides were obtained. To manage areas of text superposition (whose ideal density was not zero), the obtained images were corrected using some technicalities.
Here, we propose solving the direct problem of Equation (
1) to generate the data necessary for the training set rather than solving the inverse problem to estimate the ideal densities, which are known in this case. Operatively, each patch out of the
N selected pairs containing clean text is first binarized by the Sauvola algorithm in order to extract a map of the clean text and a map of the background. Comparing the binary map of both members of the pair allows us to locate the four classes on each side, including the occlusion. Then, the original, non-binary pairs of patches are fed into the system in Equation (
1) in a forward manner, potentially with different ink seepage percentages, so that we numerically generate synthetic samples of recto–verso text with bleed-through. The first condition in Equation (
1) permits us to simulate the saturation of the ink; that is, when a pixel is foreground text on both sides, the value of the density is set to that of the recto pixel (or the verso pixel, respectively). For the generation of a single pair of patches, the model is taken as stationary, i.e., with a fixed percentage of ink seeping. However, the construction of several pairs with different percentage values means that, as a whole, samples of non-stationary degradation will be presented to the network.
2.2. The Neural Network: Architecture, Learning, and Recall
We adopted a simple feedforward network with the architecture of a multilayer shallow neural network with one hidden layer and ten neurons and backpropagation training [
15] (see
Figure 2). To be specific, we used the function
patternnet, available since the r2010b version of the Matlab Deep learning Toolbox. We run it on the 2023a version of Matlab. This network is a pattern recognition NN that can be trained to classify inputs according to target classes.
The network processes the two sides of the manuscript simultaneously, on a pixel-by-pixel basis. For each pixel, we consider the two density values on the two sides as features, plus the two average recto and verso values of the densities of the 8-surrounding pixels. As target classes, we consider the four different classes of background, foreground, bleed-through, and occlusion.
Through construction, for the pairs of patches used to build the training set, we know the classification of each pixel on each side exactly. Thus, the target classes of the generated samples are directly available. The dataset is then randomly subdivided into a training set ( of pairs) and a validation set (the remaining ). As said, the Matlab patternnet net is used with a single hidden layer constituting 10 nodes. As the minimization algorithm (training function), we chose scaled conjugate gradient and cross-entropy to measure the net performance (performance function) during training. Tests performed with a higher number of neurons did not provide a significant improvement in the quality of the results.
In the experiments, the number of patches N used to construct the dataset varied between 2 and 10; the size of the patches was chosen between and ; and the number of different values in for the ink seepage percentage ranged from 10 to 20. The architectural simplicity of the network ensures very short learning times. The typical learning times were in the order of a few seconds when using the parameters given.
From the output of the NN, which consists of the classification of each pixel as one of the four classes, the binarized version of the manuscript can be obtained immediately by merging the pixels classified as text and occlusion into the same class and bleed-through noise and background into another single class. When the goal is instead to obtain a virtually restored version of the manuscript in which its original appearance and informative features are preserved as much as possible, the foreground text pixels, the occlusion pixels, and the background pixels are given their original values, whereas the noisy pixels are replaced with samples drawn from the closest safe background region. For this latter task, we tested various state-of-the art still image inpainting techniques and selected the exemplar-based image inpainting technique described in [
17] as the best and simplest for our purposes.
3. Discussion of the Experimental Results
The results of manuscript enhancement using our self-trained NN approach were evaluated both qualitatively and quantitatively on degraded manuscripts contained in a popular dataset created within the Irish Script On Screen Project [
18,
19]. This database can be downloaded from the website at [
20] and contains 25 pairs of aligned recto–verso manuscript portions affected by bleed-through. For each pair, the corresponding ground truths are available, which are manually constructed binary texts cleaned of degradation. The ground truths serve as a comparison for evaluating the performance of classification/binarization algorithms and, indirectly, that of virtual restoration algorithms.
For the qualitative evaluation, we examined the accuracy of the virtual restoration from a perceptual point of view. For the quantitative evaluation, we compared the NN classification results with the ground truths. The standard error measures defined in [
14] were used and displayed as plots versus the number of images. Additionally, the means of
Precision,
Recall, and
F-measure were measured and compared with those of state-of-the-art methods.
At this stage, we did not exploit color information, neither in the training nor in the classification phase. The manuscripts were converted into grayscale because we assumed that the information from the verso, even from a single channel, would be much richer than the information provided by extra observations at different wavelengths on the recto side alone. This does not mean that including the spectral diversity of the individual sides would not improve the performance of the method further. In fact, we plan to add color information in our future work. In any case, even if the classification is performed on grayscale images, the virtually restored versions of the color manuscripts can be recovered directly since the three RGB channels share the same classes.
We processed the degraded pairs provided in the dataset, but we also processed synthetic pairs generated from the ground truths with different levels of degradation, with the purpose of testing the robustness of the method under conditions with wide variability and extreme intensity of degradation.
For the qualitative evaluation, we show images of the results obtained on the ninth pair in the dataset under different network architectures. We chose this manuscript because its rendering was agreeable and illustrated our problem simply and clearly. For the quantitative evaluation, we show comparative plots of the reconstruction errors for all the pairs in the dataset, still for different network configurations.
Figure 3a,c show the recto and the reflected verso of the ninth pair with the real degradation that affected the two images.
Figure 3b,d show the manually built binary ground truths provided in the dataset. These ground truths represent the correct foreground texts on the two manuscript sides according to the perception of the human operator that constructed them.
3.1. From Two to Three Classes and from Two to Four Features: The Advantages of Distinguishing Occlusion Areas and Taking Context into Account
We processed all 25 RGB pairs in the dataset with NNs trained on clean patches extracted from the first pair in the dataset. We used three different network architectures: (i) three classes (without the occlusion class) and two features (pointwise information); (ii) four classes (adding the occlusion class) and two features (pointwise information); and (iii) four classes (with the occlusion class) and four features (local information, adding the average density values for the eight surrounding pixels on the two sides).
The training set was constructed by mixing the selected pairs of clean patches according to the data model in Equation (
1), using an ink penetration rate
q ranging from 0 to
.
Figure 4 shows a subset of the training set, consisting of the six examples generated by mixing a single pair among the eight used.
As highlighted several times, we could have used the widest range of possible interference levels, say
, to build maximally general NNs. However, it is obvious that a network focused on effective degradation is more efficient. We therefore roughly estimated the maximum value for
q for the entire dataset. In practice, we manually sampled a number of bleed-through pixels in the image for which we thought the interference was strong, then inverted Equation (
1) for
q based on the corresponding density values on the two sides, and finally averaged the obtained values.
An ink penetration rate between 0 and is a good representation of the average amount of degradation across the entire dataset. However, since the estimation was manual, some images exhibiting peaks of stronger degradation that were not represented in the training set occurred.
One solution would be to use a different NN for each image, basing the learning on the image itself. From a computational point of view, the learning phase is fast and has a minimal impact on the overall process, but the bottleneck is having to manually choose the patch pairs to use to generate the examples every time and carefully estimate the maximum amount of degradation.
In the following, we show the qualitative results of the binarization and the virtual restoration obtained on the ninth pair using the three different NNs (see
Figure 5a,c,e). These results demonstrate that even with only three classes and two features, the NN is able to satisfactorily recognize most of the bleed-through pixels so that they can be removed (see
Figure 5a,b). However, since the occlusion class is not considered, pixels that are text on both sides are normally classified as foreground on one side and bleed-through on the other, even with only negligible differences in their density. This ambiguity has the effect of producing small holes in the text characters. The introduction of the specific occlusion class is intended to eliminate this ambiguity by allowing pixels that have very similar and high densities to be classified as foreground on both sides. Indeed, already with four classes and two features, the number of holes in the characters decreases, as can be appreciated in
Figure 5c,d. Also, by increasing the number of features from two to four, with the addition of the smoothness constraint, the resulting reconstruction should be smoother and flatter. Looking again at the characters in
Figure 5e,f, this effect is very much evident and manages to correct the errors that remained in the correct classification of the occlusions. In summary, when using the final NN architecture, the reduction in the internal erosion of characters is particularly evident, and the reconstructed text characters are almost perfectly complete and filled.
The binary versions (see
Figure 5b,d,f) highlight the behaviors already observed for the virtually restored images.
Besides our visual, qualitative results, we show also a quantitative evaluation of the performance of the different NN architectures on the entire dataset of 25 recto–verso pairs.
Figure 6 shows the total errors in the binary reconstructions for the three different network architectures. Total error is measured according to the equations reported in [
14], using the ground truths available in the dataset. The progressive improvement obtained by augmenting the number of classes and features used is apparent.
We finally consider the more conventional
Precision,
Recall, and
F-measure metrics, where
Precision indicates the percentage of how many of the detected foreground pixels are correct, and
Recall indicates the percentage of how many of the correct foreground pixels are detected. These metrics are defined as follows:
where
is the binary map of the foreground text in the restored image, and
is the foreground text in the related binary ground truth mask.
We used the above measures to compare the final result of the NN architecture with state-of-the-art methods.
The values obtained are compared in
Table 1 with those of the non-blind method [
9] and the blind method [
6] found in the respective papers. The proposed method exhibits a higher precision, which means that more of the pixels recognized as belonging to the foreground are correctly identified.
It is to be noted that while the ground truth mask is fixed, in general, different binarization algorithms can be used to extract the binary mask of the foreground text from the restored images, so the resulting metric values may be affected by this choice. For instance, in our method and in [
6], the binary map of the restored foreground text is a preliminary result, whereas in [
9], the restored image is binarized using the Gatos algorithm [
21].
3.2. Robustness of the Method for Space-Varying and Strong Degradation
In a final experiment, we tested the robustness of the method with respect to the strength of the degradation and its high space variability. This experiment also allowed us to qualitatively verify the effectiveness of the degradation model adopted.
Still based on the ninth pair of images, we built an artificial clean recto–verso pair by placing the clean foreground text onto a textured background obtained by inpainting. The foreground text was obtained by selecting the RGB values for the real degraded images in
Figure 3a,c at the positions of the black pixels in the corresponding binary ground truth maps (
Figure 3b,d).
Figure 7a,b show this clean, ideal manuscript pair. An artificially degraded pair was then obtained by mixing the ideal pair using the data model in Equation (
1), where the percentage of penetrating ink was increased from
to
from left to right (
Figure 7c,d).
Note the plausibility of the visual aspect of the generated degraded images, which demonstrates the effectiveness of the degradation model adopted.
Figure 8 shows the results of applying the NNs with the three different architectures to the synthetically degraded images from
Figure 7c,d. The training set was constructed by selecting pairs of clean patches from the original ninth pair and mixing them, this time with percentages of ink penetration spanning from
to
, so as to cover the entire range of different amounts of degradation that was present in the generated degraded data.
The virtually restored recto for three classes and two features; four classes and two features; and four classes and four features is shown in
Figure 8a,c,e, respectively. Note how when there are only three classes, the reconstructed text sometimes appears corroded and fragmentary, with missing strokes. Indeed, because the degradation is very strong here, reaching up to
, the lack of a specific class for the occlusion causes text pixels on both sides to be attributed to the bleed-through class and then deleted. When the number of classes in the NN is extended to four, the text is reconstructed much better, and the characters are more complete and fuller. Note also the radical reduction in the remaining spikes in bleed-through. When the number of features is increased to four, the filling of the characters is even more apparent.
The corresponding binarizations are illustrated in
Figure 8b,d,f and confirm the above considerations.
Looking at both real and synthetic experiments, it seems that the inclusion of the smoothness constraint, expressed by the third and fourth features, works much better in completing text characters than in reducing residual bleed-through noise, which is still present in the form of scattered and isolated spikes. We believe this depends on the extent of the neighborhood used, which is currently limited to immediately adjacent pixels only.
Definitely, our results show that exploiting the information provided by the reverse side of the page can allow for satisfactory binarization even of extremely degraded manuscripts. Currently, to the best of our knowledge, no algorithm is able to handle such a dramatic degradation using information from a single side alone. We already discussed this issue clearly in [
16], where the high-performance algorithm proposed in [
22,
23] was used for comparison.
4. Conclusions
We demonstrated that joint processing of recto and verso pages of ancient manuscripts affected by ink penetration allows for significant improvements in terms of their binarization and virtual restoration thanks to the amount of information that the opposite side brings to the problem. This occurs without significant additional computational costs since scans of both sides of the sheet are normally available.
To this undoubted evidence, we have added the fact that the availability of an analytical data model describing degradation facilitates the use of NNs. Indeed, a data model allows us to artificially generate pairs of degraded recto–verso examples, whose targets are known by construction. In this way, it is possible to build a general training dataset without requiring real-world examples to be available.
We then trained a very simple shallow NN to correctly classify pixels as primary text, paper background, bleed-through noise, or overlaid text, using the data images themselves. After classification, the NN output can be used to binarize the foreground text or visually restore the manuscript so as to maintain both the fullness of the information content and the aesthetics of the original.
The method described here improves our previous proposals in two respects: (i) the correct classification of overlaps between the two texts (occlusion) by adding a specific class and (ii) the disambiguation of pixels of different natures but with similar densities, taking into account the characteristics of the adjacent pixels.
The superiority of the four-class, four-feature network resulting from this proposal is evident, both in terms of binarization and in terms of virtual restoration. This confirms that categorization methods that exploit contiguity similarity constraints are clearly superior to point-based methods.
This method could be profitably applied in libraries and archives to processing entire manuscript books or a large corpus of manuscripts of the same typology. Indeed, in these cases, digitization of the verso side is readily available, so no new acquisitions need to be performed. Furthermore, the pages will presumably be homogeneous in the terms of the character font, ink composition, and damage entities, which makes the use of a unique training set, i.e., a unique NN, built on the basis of a single page, potentially chosen in a random way, effective.
The open problems that we intend to study in the immediate future concern the following: (i) the automatic extraction of the patches to use to construct the training examples; (ii) the introduction of color information and the extension of the method to multispectral observations; (iii) investigation of the generalization capabilities of the NN (e.g., with respect to overestimation of the degradation during the training phase); (iv) the introduction of further features to express other constraints or ameliorate the constraints already used; and (v) the use of other more sophisticated neural networks since the data generation model adopted is independent from the neural network paradigm.