Learning-Free Text Line Segmentation for Historical Handwritten Documents

: We present a learning-free method for text line segmentation of historical handwritten document images. This method relies on automatic scale selection together with second derivative of anisotropic Gaussian ﬁlters to detect the blob lines that strike through the text lines. Detected blob lines guide an energy minimization procedure to extract the text lines. Historical handwritten documents contain noise, heterogeneous text line heights, skews and touching characters among text lines. Automatic scale selection allows for automatic adaption to the heterogeneous nature of handwritten text lines in case the character height range is correctly estimated. In the extraction phase, the method can accurately split the touching characters among the text lines. We provide results investigating various settings and compare the model with recent learning-free and learning-based methods on the cBAD competition dataset.


Introduction
Digital handwritten documents are not easily explorable in their raw form but need to be transcribed further into machine readable text. Certainly, manual transcription of a large number of documents is not feasible in a reasonable time. Hence, there is a practical need for reliable handwritten document image processing algorithms. Text line segmentation is an essential operation and prerequisite for many document image analysis tasks. Advancement in text line segmentation performance will boost the performance of other tasks, such as word segmentation [1,2] and word recognition [3,4].
Text line segmentation consists of text line detection and text line extraction. Text line detection locates each text line by its baseline or x-height representation. Text line extraction in turn leads to polygonal or pixel level representation of text lines. Extraction level representation is more precise and useful for higher level document image analysis tasks. With the advances in deep learning, numerous learning-based methods have been proposed for text line segmentation of handwritten documents. Learning-based methods [5][6][7][8] can inherently handle the problems arising from complex layout of text lines and heterogeneity of documents. Henceforth, the recent competition datasets [9] (Figure 1) are more challenging than the prior ones [10][11][12][13].
In the last decade several learning-free methods for text line extraction of handwritten documents have been proposed [14][15][16][17]. However, we observe only limited attempts where learning-free algorithms have been used for text line detection of challenging historical documents [9]. This is particularly interesting as learning-based algorithms require a vast amount of labeling effort. A learning-free algorithm for text line extraction is proposed in [17] where text lines are detected using multiscale second derivative of Gaussian filters and extracted using an energy minimization (EM) function. The present article extends this approach for detecting text lines of challenging historical documents. The new method proposes a robust character height estimation even with the presence of noise and chains of touching characters among multiple text lines. In addition, the extraction phase includes a mechanism to split the consequently touching text lines. Apart from these, we mathematically formulate automatic scale selection together with second derivative of anisotropic Gaussian and thoroughly present and discuss the evaluation of parameters.

Related Work
Text line extraction approaches can be classified into three categories: top-down, bottom-up, and hybrid. Top-down approaches partition document images into text lines based on global features. Bottom-up approaches group pixels or components based on local features to form text lines. Hybrid approaches combine top-down and bottom-up techniques.

Top-Down Approaches
Top-down approaches are mainly based on projection profile, Hough transform, smearing, or seam carving. Projection profile sums pixel values among the horizontal axis for each y axis value of a binary image to determine locations of text lines.
Projection profile is commonly used for simple document images [18] but can also be adapted for gray scale images [19] and slightly skewed lines [20].
Hough transform calculates parameters of linear structures produced by the text components in the document image. Shapiro et al. [21] used Hough transform to find the global orientation of a document image and apply projection profile along this orientation. Hough transform can also be applied to the centroids of connected components and directly align them as text lines [22]. Hough transform is robust for dealing with skewed straight lines but entails high computational cost.
Smearing fills the white space between the consecutive black pixels along the same direction if their distance is within a predefined threshold [23]. Smearing is sensitive to overlapping text strokes, therefore, Shi et al. [24] smeared background pixels running through the overlapping strokes to build line separators. Later on, Alaei et al. [25] adapted smearing to skewed lines by applying it in a strip-wise fashion. The main difficulty of smearing is determining the optimum threshold and is dealt in [26] by using steerable directional filters.
Seam carving computes the path of minimum energy cost from one end of the image to another. Saabni and El-Sana [27] employed medial seams for text line extraction of binary document images using signed distance transform as the energy map. Subsequently, they improved the method for gray scale documents using geodesic distance transform as energy map [16].

Bottom-Up Approaches
Top-down approaches process the document images at global level, which is problematic when the document does not have a Manhattan layout. Therefore, bottom-up approaches process document images at local level and do not assume straight lines. They group the elements into text lines. Elements can be pixels, super-pixels, or connected components. The counterpart of this is isolation of local elements, which is complicated for touching components across consecutive lines. Bottom-up approaches are mainly based on clustering or classification.
Clustering algorithms group elements according to their features in an unsupervised manner [28]. They can be applied both to the binarized document images [13,29] as well as to the gray scale document images [30,31]. The above mentioned works clustered super pixels into words and locally join them to form the text lines. Recently, Gruuening et al. [6] clustered super pixels into text lines in a greedy manner. As a result, this approach is applicable to different datasets without parameter tuning. Clustering algorithms are suitable for heterogeneous document collections; however, the number of clusters has to be selected sensitively.
Classification algorithms classify the elements according to their features in a supervised manner. They are robust to noisy and transformed images but require a large amount of annotated data for training. Early methods used nonconvolutional classifiers with hand crafted features [12,32,33]. Recent methods are inspired by convolutional neural networks, which have proven to be efficient. Moysset et al. [34] used recurrent neural network to segment isolated paragraphs. Pastor et al. [35] used convolutional neural network, first to classify page pixels as paragraph, and then to classify paragraph pixels as text line or non text line. Text line extraction from full page is studied as a problem of predicting the bounding box around the text lines [36,37]. Sequential nature of these methods limit them to being used by sliding windows on horizontal text lines. Pixel classification in a sliding window fashion is not desirable due to redundant and expensive computation of overlapping areas in the sliding windows. As a remedy, dense prediction has been successfully used for text line segmentation of handwritten documents [5,8].

Hybrid Approaches
Hybrid approaches combine the strengths of top-down and bottom-up approaches while reducing the weaknesses of each one. Fischer et al. [38] calculates the starting point and the skew of text lines with projection profile, and then uses them to search a piecewise linear separating path. Kumar et al. [39] globally estimates coarse text lines and locally reassigns misclassified elements to split the touching text lines. Clausner et al. [40] focuses on solving skewed and touching text lines using a combination of clustering connected components and projection profile analysis.

Scientific Background
In this section we present notations and definitions of scale-space representation with automatic scale selection, component tree, and energy minimization via graph cuts.

Scale-Space Representation with Automatic Scale Selection
The notion of scale is important when processing unknown image structures by automatic methods. This problem can be approached by representing image structures at different scales, so called scale-space representation [41]. However, scale-space representation does not address the problem of how to select local appropriate scales. Therefore, we use automatic scale selection [42] that adapts to the local scales of image structures.
Scale-space representation together with automatic scale selection apply to a large class of differential image descriptors. We adapt this for detecting blobs, regions brighter than their surroundings, and from text lines and term this procedure as blob line detection with automatic scale selection.
Given n scale values (σ 1 , σ 2 , · · · , σ n ), scale-space representation is a stack of n layers where the ith layer is the convolution of image by Laplacian with the scale of σ i (Figure 2). Spatially, the Laplacian response will be maximum at the center of blob when the scale is matched with the scale of the text line. However, Laplacian response decreases as the scale σ increases. To eliminate this decay, Laplacian response is multiplied by σ 2 and is called scale-normalized Laplacian. Then, automatic scale selection assigns the maximum scale-normalized convolution response to each pixel of the image. Normally, the differential descriptor for forming a blob is the Laplacian of an isotropic Gaussian (LoG) [42]. However, we define a scale-space representation using Laplacian of anisotropic Gaussians, because a text line is elongated and has different scales along the two coordinate directions. In the following sections we define the mathematical formulas for scale-space of Laplacian of an anisotropic Gaussians and automatic scale selection using scale-normalized Laplacian.  Scale-space representation with automatic scale selection is achieved by convolving the image with different scales (σ 1 , σ 2 , · · · σ n ) and assigning the maximum scale-normalized convolution response to each pixel of the image.

Scale-Space of Anisotropic Gaussians
Given a continuous image I : R 2 → R, its scale-space of anisotropic Gaussians L : R 2 × R 2 → R is defined by following convolution: where g : R 2 × R 2 → R denotes the anisotropic Gaussian kernel In this representation * is the convolution operator, σ x is the scale parameter in horizontal direction, and σ y is the scale parameter in vertical direction.

Scale-Space of Laplacian of Anisotropic Gaussians
Scale-space of Laplacian of anisotropic Gaussians is computed by differentiating the scale-space of anisotropic Gaussians with respect to x and y two times or, equivalently, by convolving the image with Laplacian of anisotropic Gaussians Given the anisotropic Gaussian in Equation (2), its Laplacian is defined by where and

Automatic Scale Selection
In scale-space representation, the amplitude of the Laplacian in Equation (3) decreases with scale. Based on this phenomena, automatic scale selection states that local extreme over scales of scale-normalized Laplacian L σ−norm corresponds to the significant structures that are regions brighter than their surroundings.

Component Tree
Component tree [43] organizes the connected components of level sets in a tree structure. Let C t be the set of connected components obtained by thresholding with threshold t. The nodes in a component tree correspond to the components in C t for varying values of the threshold t. The root of the tree is the member of C t min , where t min is chosen such that |C t min |=1. Level in the tree correspond to C t min + d , where d is a parameter that determines the step size for the tree. We use d = 1 in all the experiments. There is an edge between C i ∈ C t and C j ∈ C t+1 if and only if C j ⊆ C i . The maximal threshold t max used in the tree construction is simply the maximal value in the map. Figure 3 illustrates the above definitions.

Method
The proposed approach utilizes scale-space representation with automatic scale selection to detect blob lines that strike-through the text lines. The detected blob lines are then binarized by component tree algorithm. Finally, energy minimization via graph cuts extracts the text lines with the help of the detected blob lines.

Blob Line Detection Using Automatic Scale Scale Selection
We informally define a blob to be a connected region that is significantly brighter than its neighborhood. Text line detection aims to derive the blob lines that strike-through the text lines. These blob lines are derived by convolving text lines with the Laplacian of anisotropic Gaussians in a range of scales corresponding to the height range of the characters (Equation (5)).
Character height range σ x is computed automatically using either way: (1) Component Evolution Map (CEM) [44] estimates character height range by analyzing the height distribution of connected components for each possible grayscale threshold; (2) Mean height of components estimates character height range as [µ, (µ + σ)], where µ and σ are the average and standard deviation of components' heights in the document.
These Gaussian filters are elongated along the horizontal direction, as such for every value of where e is the elongation rate and its optimal value is experimentally determined in Section 6.2.6. Then, for each pixel we chose the strongest response along the scale-normalized Laplacian given by Equation (8). We investigate the effectiveness of horizontally elongated filters on text lines with varying inclination. The results suggest that this approach appears to be effective in detecting almost horizontal text lines ( Figure 4).

Rotated image
Grayscale blobs Binary blobs

Blob Line Binarization
The blob lines detected by automatic scale selection are represented by a grayscale image as illustrated in Figure 5a. We investigate component tree algorithm to gather the binary blob lines that strike through the text lines.

Component Tree
Component tree [43] binarizes the grayscale blob lines image by fitting k knots linear splines of least squares to each of the candidate blob lines. A candidate blob line is labeled valid if it satisfies both fitting scores, Fitting Score 1 (FS1) and Fitting Score 2 (FS2). FS1 requires the average 1-norm of each blob pixel from the spline to be less than the maximum character height. This condition eliminates the fat blobs that include two consecutive lines. FS2 requires the ratio of the blob area and the sum of the distances of the contour pixels from the spline to be less than 0.9. This condition eliminates the nonconvex blobs with a partial merge (Figure 5c). We traverse component tree in a breadth first search manner with d = 1. At each node, if the connected component is valid, it is taken as a binary blob line, and the search along this branch is complete. Otherwise, component is refined by recursively processing the children of the node. Figure 5 illustrates component thresholding procedure which stops at valid blobs.

Text Line Extraction with Energy Minimization Using Graph Cuts
Binary blob lines that are detected in Section 4.2 correspond to the text line detection phase. Extraction level representation requires further assigning a text line label to each pixel in the document image. We use energy minimization [45] for assigning connected components to text line labels with the help of detected blob lines. It urges to assign components to the label of the closest blob line, while straining to assign closer components to the same label and not to assign any component to spurious blob lines. Let L be the set of binary blob lines and C be the set of connected components in the binary document image. Energy minimization finds a labeling f that assigns each component c ∈ C to a label l c ∈ L, where energy function E( f ) has the minimum.
Energy function has three terms: · denotes expectation over all pairs of adjacent elements [46]. δ( c = c ) is 1 if the condition inside the parentheses holds, and 0 otherwise. 3. Label cost For every blob line ∈ L, h is defined as exp(2 · r ) where r is the normalized number of foreground pixels overlapping with blob line . The higher the label cost, the higher the probability of discarding the blob line as spurious line.

Merging Broken Blob Lines
Once the EM removes the spurious blob lines, there may still exist the problem of broken blob lines. Given the binary image of blob lines, for each blob line we extract its left and right endpoints and define its direction as the vector connecting the left endpoint to the right endpoint. Two adjacent blob lines are merged if (1) the direction of the vector connecting the right of the first component to the left of the second one falls between the direction of the two blob lines, (2) their vertical distance is less than the maximum character height.

Splitting Touching Characters
Text line extraction results contain unsegmented touching characters from adjacent text lines. To check whether a connected component c overlaps more than one blob line, we relabel c by assigning each pixel in c to the label of nearest blob line.

Evaluation
This paper proposes a learning-free text line segmentation method for challenging historical handwritten documents as such cBAD dataset (https://zenodo.org/record/835441). In addition, we also evaluated the method on another recent handwritten text line segmentation dataset, DIVA-HisDB (http://diuf.unifr.ch/hisdoc/diva-hisdb). For each dataset, we use only its test set because our method is not learning based. The proposed method's output is defined as pixel labels but it is manipulated according to the ground truth definition of each dataset.

Cbad Dataset
cBAD dataset [9] contains 539 document images from 7 different archives. This dataset's ground truth is defined as baselines whereas our method extracts pixel labels. To find baselines from pixel labels, first we get tight polygons around the pixel labels of text lines. Then, for each bounding polygon we get its lower contour points and iteratively fit a regression line among these points by excluding the outliers ( Figure 6). Using the extracted baselines, the performance is measured by means of Precision (P), Recall (R), and F-measure (FM), as described in [9].

Diva-Hisdb Dataset
DIVA-HisDB dataset [47] contains 30 pages from 3 medieval manuscripts. Ground truth is defined as bounding polygons. Therefore, we get the tight polygons around the pixel labels of our output and measure the performance by means of the Intersection over Union (IU) as described in [47].

Experimental Setting
In this section we do ablation studies for adapting the learning-free algorithm proposed in [17] to challenging historical handwritten documents. Default parameters of the learning-free algorithm proposed in [17] are like the following: 1. Estimating character height range using CEM as described in Section 4.1 2. Blob line binarization using component tree with both of the fitting scores given in Section 4.2.1 and using number of knots k = 20. 3. Merging broken blob lines as described in Section 4.4 4. Binarizing the document image using Otsu thresholding, to be used with the EM function. cBAD dataset is a relatively large dataset and the proposed algorithm takes around one day to run on the whole dataset. For this reason we randomly sample a representative subset of cBAD dataset, one page from each collection, and do the ablation studies on this subset.

Text Line Detection Experiments
Text line detection stage is based on automatic scale selection by convolving the document image with second derivative of anisotropic Gaussian filters with a range of scales estimated as the character height range of the text in the document. The output of this filtering is a response map, a grayscale image of the detected blobs. This grayscale image is binarized to gather the binary blob lines that strike-through the text lines.

Effect of Character Height Range
Character height range that is used with the automatic scale selection inevitably effects the detected blob lines. Cohen et al. [17] uses CEM for character height range estimation, but many touching text lines mislead the algorithm. To tackle this problem, we experiment with mean height of connected components (Section 4.1). Recognizing that the noise and long chain of touching text lines is also misleading, we also use refined mean estimation by excluding the components that are bigger than a maximum threshold (100 pixels) and smaller than a minimum threshold (10 pixels). Considering that the text is mostly located in the center region of a document, we also exclude the outer 2% of the document image. We found that the results are less than ideal and further experiment with half of the above range estimations is needed. Figure 7 shows that there are no sharp changes between the results of different height estimations. This is valid for such sample document without severe noise. As can be seen, the half values of range values provide thinner blob lines that are more apart while provoking spurious blob lines. Qualitative analysis revealed a considerably higher performance and lower run time of the half-refined mean estimation for character height range (Figure 8).

Effect of Merging Blob Lines
In case of a big gap between words, a blob line can be disconnected. In this experiment we tested to which extent merging the blob line improves the results. We found that merging slightly increases the effectivity (Table 1).

Effect of Blob Line Fitting Scores
The quality of detected blob lines is a key factor that influences the final performance. A blob line is preferable as plain, continuous, and apart as possible. Cohen et al. [17] defines this preferability by two fitting scores (Section 4.2.1). The results using only the FS1 provide compelling evidence that nonconvex blobs with a partial merge are negligible ( Table 2). The results in Section 6.2.3 indicate that using only the FS1 is reasonable. FS1 eliminates the fat blobs that include two consecutive lines. Cohen et al. [17] conditions the FS1 to be less than the upper character height. Relaxing this condition decreases the run time whereas stressing it increases the run time without an evident performance gain (Figure 9).

Effect of Number Of Knots
The aim of using splines is their ability to fit on curved blob lines as well as straight blob lines. Cohen et al. [17] uses 20 knots to fit the linear splines. We investigate different number of knots and show their effect on the method's performance. Results are presented in Figure 10. As expected, varying the number of knots does not impact the performance measure, because almost all text lines in the evaluation datasets are horizontal. However, the spline fitting makes the proposed method suitable for curved text lines [48]. It is apparent that recall values follow a higher trend than the precision values, possibly due to the spurious blob lines that could not be eliminated by the label cost because of crowded ascenders and descenders overlapping these spurious lines.

Effect of Elongation Rate
The proposed method assumes the text lines are almost horizontal. Therefore, convolution of a text line with the second derivative of an anisotropic Gaussian elongated along the horizontal direction generates a blob line that strike-through the text line. Figure 11 shows the blob lines resultant from two marginal cases: e = 1 and e = 7. There is a trade-off between the marginal cases. As e decreases as the blob lines are tight-fitting, e increases as the blob lines are loose-fitting. Tight-fitting misses the spatial context and may consider two touching text lines as one (Figure 11b). Loose-fitting grasps the unrelated spatial context and may generate spurious blob lines (Figure 11f). From the performed experiments, the optimal point seems to be when e = 2 (Table 3). As e decreases the spatial context is missed and two touching text lines are considered as one (b). As e increases the unrelated spatial context is grasped and a spurious blob line is generated (f). We have demonstrated that alternative values of parameters for text line detection produce slightly different results, with the exception of the character height estimation method. Half values of the refined-mean of the connected components achieve a superior performance over the default character estimation method, CEM.

Text Line Extraction Experiments
In all the text line extraction experiments, we use half values of refined-mean for character height estimation. FS1 with a threshold of t = 1.1 merge the binarized blob lines. We set the number of knots to k = 20 and elongation rate to e = 2.
The energy function is formulated in the binary image domain, thereby, any noise pixel remained from the binarization process breaks down the accuracy. Therefore, we examined the effect of two binarization methods, Otsu [49] and Bar-Yosef et al. [50]. Both binarization results suffer either from losing parts of text lines or from noise parts of text lines. However, the performance is not sensitive to the binarization method (Table 4). On the other hand, splitting the touching characters among two text lines lead to a significant improvement (Table 4).

Results
We present results on two recent historical handwritten datasets, DIVA-HisDB and cBAD. For text line detection, we use half of refined-mean for character height estimation, FS1 with a threshold of t = 1.1, merge the binarized blob lines, and set the number of knots to k = 20 and elongation rate to e = 2. For text line extraction, we use Bar-Yosef et al.binarization and split the touching characters among text lines.
DIVA-HisDB is a historical document dataset with a high number of consequently touching text lines and heterogeneous in terms of text line height. The input to our algorithm is the ground truth of layout analysis task of the competition. Table 5 shows our results compared to two learning-free algorithms. One is the best performing algorithm in the ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts [47]. The other [51] is an improved version of the best performing algorithm in the competition. The proposed algorithm can reach a line IU of 100%. The degradation in pixel IU mostly comes from the coarse splitting of touching characters among text lines. [51] performs perfect in terms of the overall Line IU at 100%. The advantage of the proposed algorithm is that it can detect text lines irrespective of touching components across the text lines and is robust to complex layouts. However, its performance is highly dependent on the correct estimation of average character height (Figure 8). Whereas the desired property of the work proposed by [51] is that their results are fairly stable in respect to the varying values of parameters. However, this method is not appropriate for complex layouts with heterogeneous text line lengths or separate text line groups because it labels the text lines with the intuition that all components belonging to the same text line have the same amount of seams below them. cBAD is a historical document dataset with a larger evaluation set and contains documents with varying layouts, originating from different time periods and locations. It is heterogeneous in terms of text line height and length. Input documents are gray-scale images. Some examples of pages with extracted bounding polygons can be seen in Figure 12. Common error cases arise from a group of touching characters that mislead the algorithm to merge the text lines (Figure 12b). Another type of error occurs due to the artifacts in the frame of the page images. We compare our results with the results obtained in cBAD competition. Table 6 shows precision, recall, and F-measure values of the participants. The methods are grouped as learning-based and learning-free. The best performance with an F-measure value of 97.70 is for a learning-based method, DMRZ [9] Two learning-free methods, our method and IRISA [9], achieve close F-measure values. The precision and recall values indicate that the IRISA method splits baselines more precisely but also misses more baselines compared to our method. Finally, in Figure 13 we present the results obtained during the RASM2018 competition [53] for historical scientific manuscripts in Arabic. We submitted output text lines and results were evaluated by the competition committee using a success rate defined in [53]. The results and the comparison with other participator methods are presented in Table 7.    Figure 13. Example results from RASM2018 dataset. Some errors occur due to non-textual elements such as the underlines in (a). Some errors occur because of satellite textual elements are grouped separately (b). Text lines with heterogeneous lengths (c). Touching characters are split but not very adequately (d).

Conclusions
We presented a learning-free text line detection and extraction method for historical handwritten document images. Historical handwritten documents contain complex layouts with varying text line heights and lengths, touching characters, crowd of ascenders and descenders. Learning-based algorithms are currently an increased trend in text line segmentation of historical handwritten documents; however, they require labeling effort for training. The proposed method can fairly detect the blob lines that strike-through text lines with arbitrary heights and lengths by convolving the input image with second derivative of anisotropic Gaussian using automatic scale selection. On the other hand, EM formulation removes spurious blob lines and assigns the connected components to the closest detected blob line or to the label of the closest component. Ablation study shows that the method is not sensitive to the parameters except character height estimation. Another limitation of the method is that it can deal with severely skewed text lines. This limitation can be overcome using multiorientated and multiscale anisotropic Gaussians. The results on three different datasets are not always the state of the art but achieved using the same experimental setting. In the future, we plan to develop an unsupervised machine learning method for text line segmentation, using the blob lines detected by the proposed method as an annotation of the text lines.