Handwriting-Based Text Line Segmentation from Malayalam Documents

Application: The proposed technique and the database


Introduction
The Malayalam language is spoken by the people of Kerala State in India.After the pandemic situation created by the novel coronavirus, it has become a necessity to encode the local language, conventionally written in pen and paper, in an electronic format.Optical character recognition (OCR) systems convert handwritten documents to a computereditable digital form.This is highly beneficial for individuals to share documents among different offices, banks, and teacher notes when conducting online classes in the local language.Malayalam is a language with a rich character set and it is written from left to right.Some of the vowel diacritics in the language extend above or below the full length of a normal alphabet.The language contains compound letters formed of two different characters.The characters have a looped and curved nature.Some characters, like 'Chandrakkala', in the language are written with a space above another character.Because of the large variations in handwriting styles, large gaps are created above or below the characters where such letters are written.Some of these possible cases are illustrated in Figure 1.All these factors make the recognition of handwritten documents in this language a complex problem [1,2].Therefore, the development of OCR for the recognition of unconstrained handwritten document images for the Malayalam language has not progressed yet.The work presented in this paper extracts the text lines from handwritten by considering different handwriting styles in the language.Most of the research is conducted on the recognition of isolated handwritten Malayalam characters.As far as unconstrained handwritten documents are considered, the recognition of individual characters can be performed in a hierarchical way.First, text lines are extracted; then, words are segmented from each line; after this, characters are segmented from each word.These characters are recognized and converted into digital form.Using this digital representation, the handwritten characters are encoded into a printable text format.The success of this entire process is mainly dependent on the correctness of the text line extraction step.In this paper, a novel technique for the extraction of text lines from Malayalam handwritten documents is developed.The paper is organized as follows.Section 2 describes the related work in this area.Section 3 presents the proposed methods for text line extraction.The experimental setup, results and discussion are detailed in Section 4, while Section 5 concludes the work.

Related Works
Optical character recognition (OCR) converts pdf files, scanned documents, images containing text and printed and handwritten documents into editable electronic documents.OCR can be implemented for printed character recognition and handwritten character recognition.The latter is further categorized into online and offline recognition systems.In online systems, real-time recognition of characters is performed while the user is writing.Offline recognition is performed on document images or pdfs.Therefore, it is a more complex problem than online recognition.OCR for languages like English and Chinese is available in handheld devices and personal computers [3,4].In developing OCR for handwritten documents, text line segmentation is a major step and is a challenging problem.Ref. [5] gives an overview of the different methods used for text line extraction, such as projection profiles, smearing methods, Hough transform-based methods and stochastic methods using the Viterbi algorithm.A Khandelwal et al. proposed a method to extract text lines by considering the height of the connected components and neighborhood connectivity [6].In [7], connected components in the image are partitioned into three subsets based on their height.Every connected component in a subset is bounded by an equally sized block.Text lines are extracted by applying a Hough transform to these blocks.Ref. [8] presents text line extraction from handwritten documents using natural learning techniques based on the Hough transform.The Hough transform is applied to the minima points of connected components in a small strip of the image.Minima points are then clustered by applying a moving window algorithm to extract the text lines.A Souhar et al. [9] performed text line segmentation using a watershed transform for Arabic documents.Text lines were linked by connected component analysis and the watershed The paper is organized as follows.Section 2 describes the related work in this area.Section 3 presents the proposed methods for text line extraction.The experimental setup, results and discussion are detailed in Section 4, while Section 5 concludes the work.

Related Works
Optical character recognition (OCR) converts pdf files, scanned documents, images containing text and printed and handwritten documents into editable electronic documents.OCR can be implemented for printed character recognition and handwritten character recognition.The latter is further categorized into online and offline recognition systems.In online systems, real-time recognition of characters is performed while the user is writing.Offline recognition is performed on document images or pdfs.Therefore, it is a more complex problem than online recognition.OCR for languages like English and Chinese is available in handheld devices and personal computers [3,4].In developing OCR for handwritten documents, text line segmentation is a major step and is a challenging problem.Ref. [5] gives an overview of the different methods used for text line extraction, such as projection profiles, smearing methods, Hough transform-based methods and stochastic methods using the Viterbi algorithm.A Khandelwal et al. proposed a method to extract text lines by considering the height of the connected components and neighborhood connectivity [6].In [7], connected components in the image are partitioned into three subsets based on their height.Every connected component in a subset is bounded by an equally sized block.Text lines are extracted by applying a Hough transform to these blocks.Ref. [8] presents text line extraction from handwritten documents using natural learning techniques based on the Hough transform.The Hough transform is applied to the minima points of connected components in a small strip of the image.Minima points are then clustered by applying a moving window algorithm to extract the text lines.A Souhar et al. [9] performed text line segmentation using a watershed transform for Arabic documents.Text lines were linked by connected component analysis and the watershed transform was applied to detect the text lines.Another approach uses a baseline detected using a projection profile, and then a watershed transform is applied on the extracted path to segment the text lines.
Deep learning architectures like convolutional neural networks (CNN) and generative adversarial networks (GAN), trained using annotated images, are used to extract the text lines in [10,11].B K Barakat et al. [12] proposed an unsupervised deep learning technique for the extraction of text lines.It makes use of the difference in the number of foreground pixels in the text line region and the space between the text lines.Ref. [13] presents a learning free algorithm for text line segmentation.On convolving the input image and second derivative of an anisotropic Gaussian filter, blob lines are detected, which strike through the text lines.Text lines are extracted by assigning connected components to the blob lines detected.Text line extraction from handwritten documents in Indian languages such as Oriya, Bangla and Kannada has been performed by applying a projection profile on the document image [14][15][16].Works on the recognition of handwritten Malayalam characters have mainly focused on classifying isolated letters.Jomy John et al. [17] proposed a method to recognize the individual characters in Malayalam using the gradient and curvature features of handwritten characters.In [18], handwritten Malayalam character recognition is performed by determining the position and number of horizontal and vertical lines in the skeletonized character set.A chain code histogram from the chain code representation of the boundary of a skeletonized character image is used as a feature vector for character recognition in [19].Ref. [20] presents the recognition of handwritten Malayalam vowel characters using a Hidden Markov Model tool kit.P. Jino et al. proposed stacked long short-term memory (LSTM) neural networks for the recognition of Malayalam letters [21].In [22], the count of the zero-crossings of the wavelet coefficient is used as a feature to classify characters.A 1D wavelet transform is applied on the horizontal and vertical projection of the character image and the transform coefficients are considered as the feature vectors for classification in [23].A benchmark database for isolated Malayalam handwritten characters was created in the work published in [24], and it also showcased the alphabets in the Malayalam language.Malayalam OCR for the recognition of printed characters is available online [25,26].The software e-Aksharayan converts printed documents in seven Indian languages, including Malayalam, into an editable text form [27]. C Shanjana et al. [28] proposed a technique for the recognition of characters from unconstrained Malayalam handwritten documents.They segmented text lines, words and characters in the handwritten document image using horizontal and vertical projection.
A novel method is proposed in this paper to extract the text lines from Malayalam handwritten documents.The proposed method is detailed in the next section.

Proposed Method
The different steps involved in the newly developed technique for the extraction of text lines are shown in Figure 2. The main steps in the proposed method can be summarized as preprocessing, the detection and segmentation of overlapping lines and the detection and segmentation of incorrectly segmented short lines to the correct lines.These steps are explained in the following section.

Preprocessing
The handwritten document is first scanned and a color image in PNG format is obtained.It is converted into a grayscale image, which is binarized using Otsu's global thresholding algorithm [29].In MATLAB, the function graythresh() returns an imagedependent threshold value, calculated using Otsu's global thresholding algorithm.This threshold is used by the function imbinarize() to convert the grayscale image into binary with black characters on a white background.The binarized image is then inverted so that the image can be subjected to horizontal and vertical projection to segment the text lines.The preprocessed image thus obtained is named I b .The image I b is divided into three vertical stripes of equal width.A morphological operation is performed on each vertical stripe to eliminate the isolated pixels, i.e., 1 surrounded by 0s.Following this, a morphological bridge operation is performed on all the vertical stripes to connect or bridge the pixels that are not connected.A summary of the overall processes in performing text line extraction after the preprocessing step is explained below.

Preprocessing
The handwritten document is first scanned and a color image in PNG format is obtained.It is converted into a grayscale image, which is binarized using Otsu's global thresholding algorithm [29].In MATLAB, the function graythresh() returns an image-dependent threshold value, calculated using Otsu's global thresholding algorithm.This threshold is used by the function imbinarize() to convert the grayscale image into binary with black characters on a white background.The binarized image is then inverted so that the image can be subjected to horizontal and vertical projection to segment the text lines.The preprocessed image thus obtained is named  .The image  is divided into three vertical stripes of equal width.A morphological operation is performed on each vertical stripe to eliminate the isolated pixels, i.e., 1 surrounded by 0s.Following this, a morphological bridge operation is performed on all the vertical stripes to connect or bridge the pixels that are not connected.A summary of the overall processes in performing text line extraction after the preprocessing step is explained below.
Text lines are extracted in each vertical stripe separately using the horizontal projection (HP) method described in [28].The extracted text lines from each vertical stripe may be correctly or incorrectly segmented lines.Incorrect segmentation is due to the presence of overlapping lines or due to some special characters such as 'Chandrakkala' in the Malayalam language and the wide separation of some letters written below a particular character.The latter case contributes to the incorrect segmentation of such characters into separate lines called short lines.All these possibilities are checked for and corrected by the proposed technique.As a first step, the extracted line in the vertical stripe is checked regarding whether it is an overlapped line.If not, then it is checked for a short line.If a line is detected as an overlapped line, it is further divided into individual lines.The individual Text lines are extracted in each vertical stripe separately using the horizontal projection (HP) method described in [28].The extracted text lines from each vertical stripe may be correctly or incorrectly segmented lines.Incorrect segmentation is due to the presence of overlapping lines or due to some special characters such as 'Chandrakkala' in the Malayalam language and the wide separation of some letters written below a particular character.The latter case contributes to the incorrect segmentation of such characters into separate lines called short lines.All these possibilities are checked for and corrected by the proposed technique.As a first step, the extracted line in the vertical stripe is checked regarding whether it is an overlapped line.If not, then it is checked for a short line.If a line is detected as an overlapped line, it is further divided into individual lines.The individual lines obtained after segmenting the overlapped lines are checked for an incorrectly segmented short line.If detected as a short line, it is joined to the correct line.The line segments obtained in the three vertical stripes after all these procedures are joined with the corresponding line segments in the other vertical stripes.These methods are developed with the assumption that none of the lines begin as a paragraph with large indentation from the margin.Moreover, it is assumed that, except for the last line in the page, no lines terminate before the end of a line in the page.These steps are taken to ensure that all three vertical stripes contain parts of every line except the last line.A detailed description of the developed techniques to detect and process overlapping lines and short lines is given in the following subsections.

1.
Find the average value of the height of the characters (Avg c ht ) in the preprocessed handwritten image, I b .This is obtained by finding the average height of the connected components in I b .

2.
Identify the region containing the lines in each vertical stripe.Then, find the median value of the number of rows in each region, which indicates the line height, L ht .The obtained value is the median value of the line heights, M L ht , in a vertical stripe.

3.
The threshold value, T ov , for identifying the overlapping lines is calculated based on the values obtained from step 1 and step 2. Threshold T ov is computed as follows.If (M L ht > 4 × Avg c ht ), Otherwise, 4.
Compare the height of each line with the threshold value, T ov . 5.
If the height of the line segment is above or equal to T ov , it is detected as an overlapped line.

6.
If a line is detected as an overlapped line segment, then the number of lines L cnt in the overlapped line is calculated as follows.If (M L ht > 5 × Avg c ht ), If (M L ht > 3 × Avg c ht and M L ht ≤ 4 × Avg c ht ), The size of the characters written may vary from person to person.This is considered while framing Equations ( 1)-( 6) to compute the threshold value T ov and the number of lines in the overlapped segment L cnt in the proposed technique.Once a line segment is identified as having overlapping lines, it has to be segmented into individual lines.The steps involved in segmenting the overlapping lines are shown in Figure 3 and a detailed description is given in the next subsection.

Separation of Overlapping Lines
A novel technique is developed to separate overlapping lines present in the extracted text line segment from vertical stripes.The number of times that the method is applied for the separation of these overlapping lines depends on the number of overlapping lines in the extracted text line.This is obtained using Equations ( 3)- (6).The technique then determines the region, R ov , where the initial segmentation is to be carried out.Let r 1 and r 2 be the rows between which the overlapping of lines occurs, which are determined by Equations ( 7)- (10). Otherwise, Appl.Sci.2023, 13, x FOR PEER REVIEW 6 of 21

Separation of Overlapping Lines
A novel technique is developed to separate overlapping lines present in the extracted text line segment from vertical stripes.The number of times that the method is applied for the separation of these overlapping lines depends on the number of overlapping lines in the extracted text line.This is obtained using Equations ( 3)-( 6).The technique then determines the region,  , where the initial segmentation is to be carried out.Let  and  be the rows between which the overlapping of lines occurs, which are determined by Equations ( 7)- (10).
If ( > 4 ×  ), The reason for overlapping lines is the presence of ascenders and descenders in the language.The ascenders and descenders will be present in the region between r 1 and r 2 .The parts of the characters that are present in this region must be identified as ascenders or descenders.Accordingly, the overlapping lines have to be segmented.Each line from the region containing the overlapping lines is segmented out one by one.
For this, the beginning and end of each character in the region R ov is found using the horizontal projection method.The beginning of a character is the first non-zero horizontal projection (HP) value that exists between r 1 and r 2 .Similarly, the last non-zero HP value is the end of the character in the vertical direction.These positions can be named ch o beg and ch o end .This region is specified as the character region in the flowchart shown in Figure 3.In order to identify the presence of ascenders and descenders in this region, the row number of the largest two HP values between ch o beg and ch o end is found.Let the row containing the highest HP value be rh p 1 and the second highest HP value be rh p 2 .Now, the text line region containing the overlapping lines is divided into three parts, the upper region (U ov ), lower region (L ov ) and character region.The region above the character region is termed the upper region, U ov , and that below the character region is termed the lower region, denoted as L ov .The portion of the characters located in the region of overlapping lines is associated with the appropriate line as per the following rule.
If Thus, the upper region and lower region are updated.The rows between rh p 1 and rh p 2 are divided columnwise to separate the portion of the characters present in this region.This portion contains the ascenders and descenders that are responsible for the overlapping.The columnwise segmentation is performed using the vertical projection (VP) method.For each such segmented character, if the HP value in the first row is greater than that in the last row, the character is joined to the lower region, considering it as an ascender.Otherwise, it is considered as a descender and joined to the upper region.
After segmenting the overlapped lines, it is checked whether they are short lines.The technique developed to detect the short lines is described in the next subsection.

Detection of Incorrectly Segmented Short Lines
As discussed in Section 3.1, short lines are the result of incorrect segmentation due to the spaces created by characters written above or below a letter.Depending on the handwriting, this spacing will vary.Short lines contain sparsely scattered fragments of characters and therefore the height will be smaller.Hence, the threshold to detect short lines depends on the height of the line, the width of the characters present in the line and the number of non-zero vertical projection values.Because of this dependency, thresholds for detecting short lines are developed and are given in Equations ( 11)-( 14).

1.
Determine the threshold value t sh 1 using the equation where Avg ch wid is the average character width in a document image.

2.
A second threshold value t sh 2 is determined as follows. 3.
Compute the number of non-zero vertical projection values, VPN, in the given text line segmented from the handwritten document.This is determined to check the sparsity of character fragments present in the short line.

4.
A short line is detected If (VPN < 2 × t sh 1 and VPN > t sh 2 ) Or If (VPN 9 × t sh 1 and VPN t sh 2 and L ht < 0.5 × M L ht ).Depending on the handwriting, the values of Avg c ht , Avg ch wid , M L ht may vary from one document to another.

Joining of Incorrectly Segmented Line to the Correct Line
If the detected short line is the first line in a vertical stripe, we join it with the next line.If it is the last line, we join it with the previous line.If the line is between the first and last lines, we find the position at which the HP value is highest in the line.If the position is after the middle of the line, then the line is considered incorrectly segmented due to the character 'Chandrakkala'.Therefore, it is joined to the next line; otherwise, it is joined to the previous line.

Results and Discussion
The proposed techniques are implemented using the software MATLAB R2022b.A new database of Malayalam handwritten documents, LIPI, is created as an initial step in this research work.All the techniques proposed in this paper are validated using images taken from this database.As discussed in Section 1, the proposed methods are developed by considering the specific characteristics of the Malayalam alphabet and different handwriting styles.Therefore, a publicly available database of other languages cannot be used to test the proposed method.A brief description of the newly developed database, LIPI, is given in the next subsection.

Database for Malayalam Handwritten Documents
The database is created by collecting Malayalam handwritten documents from professionals and undergraduate and postgraduate students in the age group between 18 and 45.The articles for the manuscript are collected from leading Malayalam newspapers and textbooks that cover all the alphabets, consonant diacritics and conjunct consonants in the Malayalam language.Initially, a manuscript of the article is written on A4 size paper without any constraints on the pen and the script of the Malayalam language.At present, people write using a combination of old Malayalam script and new Malayalam script.In total, 402 handwritten documents collected from 200 people are scanned using an Epson L310 flatbed scanner with 200 dpi.The time taken to scan one document is approximately 29 s.A faithful representation of the images is obtained.Some observations are that, in 1% of the scanned image, some straight lines are seen at the bottom of the paper where text is not present.These lines are automatically removed during the processing of the vertical stripes to eliminate short lines.All images are scanned at the same rate.No prominent errors are obtained, according to the handwriting style of the author.The text that is written at the right end of the paper without a margin may be lost as it crosses the scanning boundary.
The scanned image is in PNG format and its size is 2338 × 1654.An overall description of the images in the database is given in Table 1.The ground truth images for all the lines in the document are created manually using the freehand crop tool in the MATLAB software.
A sample image from the newly developed LIPI database is shown in Figure 4 and the image after binarization is shown in Figure 5.The ground truth images created for text lines 1 and 5 in the image of Figure 5 are shown in Figure 6a,b, respectively.Ground truth images are created for each of the 7535 text lines in the handwritten document images.As shown in Figure 6a,b, the exact position of the text lines in the document images is retained in the ground truth images.A sample image from the newly developed LIPI database is shown in Figure 4 the image after binarization is shown in Figure 5.The ground truth images created for lines 1 and 5 in the image of Figure 5 are shown in Figure 6a,b, respectively.Ground tr images are created for each of the 7535 text lines in the handwritten document images shown in Figure 6a,b, the exact position of the text lines in the document images is retai in the ground truth images.A sample image from the newly developed LIPI database is shown in Figure 4 the image after binarization is shown in Figure 5.The ground truth images created for lines 1 and 5 in the image of Figure 5 are shown in Figure 6a,b, respectively.Ground t images are created for each of the 7535 text lines in the handwritten document image shown in Figure 6a,b, the exact position of the text lines in the document images is reta in the ground truth images.

Implementation Results
The image obtained from the scanner is a color image, as in Figure 4.It is converted to grayscale and then to binary.The binary image is inverted to make the background black and foreground (handwritten characters) white.The binary image obtained after such processing is shown in Figure 5.The binary image is divided into three vertical stripes so as to have almost straight lines, as illustrated in Figure 7.In the proposed method, the number of vertical stripes is fixed at three.If the page is not divided into vertical stripes, then the text lines will be slanting.A similar case arises when the number of vertical stripes is 2. In both these cases, text line extraction based on the horizontal projection method will not yield good accuracy.If the number of vertical stripes is increased above three, the width of each vertical stripe decreases accordingly.Then, there is a possibility of missing the portion of the text line present in these vertical stripes completely, which is against the assumptions based on which the algorithms are designed.The text lines extracted from vertical stripes using the horizontal projection (HP) method consist of overlapping and short lines, as shown in Figure 8.Multiple lines or overlapping lines are present in vertical stripe 1 ( ) and vertical stripe 2 ( ) in lines 12

Implementation Results
The image obtained from the scanner is a color image, as in Figure 4.It is converted to grayscale and then to binary.The binary image is inverted to make the background black and foreground (handwritten characters) white.The binary image obtained after such processing is shown in Figure 5.The binary image is divided into three vertical stripes so as to have almost straight lines, as illustrated in Figure 7.In the proposed method, the number of vertical stripes is fixed at three.If the page is not divided into vertical stripes, then the text lines will be slanting.A similar case arises when the number of vertical stripes is 2. In both these cases, text line extraction based on the horizontal projection method will not yield good accuracy.If the number of vertical stripes is increased above three, the width of each vertical stripe decreases accordingly.Then, there is a possibility of missing the portion of the text line present in these vertical stripes completely, which is against the assumptions based on which the algorithms are designed.

Implementation Results
The image obtained from the scanner is a color image, as in Figure 4.It is converted to grayscale and then to binary.The binary image is inverted to make the background black and foreground (handwritten characters) white.The binary image obtained after such processing is shown in Figure 5.The binary image is divided into three vertical stripes so as to have almost straight lines, as illustrated in Figure 7.In the proposed method, the number of vertical stripes is fixed at three.If the page is not divided into vertical stripes, then the text lines will be slanting.A similar case arises when the number of vertical stripes is 2. In both these cases, text line extraction based on the horizontal projection method will not yield good accuracy.If the number of vertical stripes is increased above three, the width of each vertical stripe decreases accordingly.Then, there is a possibility of missing the portion of the text line present in these vertical stripes completely, which is against the assumptions based on which the algorithms are designed.The text lines extracted from vertical stripes using the horizontal projection (HP) method consist of overlapping and short lines, as shown in Figure 8.Multiple lines or overlapping lines are present in vertical stripe 1 ( ) and vertical stripe 2 ( ) in lines 12 The text lines extracted from vertical stripes using the horizontal projection (HP) method consist of overlapping and short lines, as shown in Figure 8.Multiple lines or overlapping lines are present in vertical stripe 1 (VS 1 ) and vertical stripe 2 (VS 2 ) in lines 12 and 13, respectively.The average height of the characters for a document in the database images is between 11 and 23 and is dependent on the handwriting of the different authors.Based on the average height of the characters, the threshold T ov , computed using Equations ( 1) and ( 2) to detect overlapping lines, is in the range of 91.5 to 140.65 for all the images in the database.This is because of the size variations in the characters written by different authors.To demonstrate the segmentation of overlapping lines using the proposed technique, such lines in VS 2 in Figure 8 are shown separately in Figure 9.It is observed that the HP value is non-zero between the overlapping lines because of ascenders and descenders in the language.To segment the overlapping lines, a region is found using the proposed method and it is shown using dotted lines for a better understanding in Figure 9.The overlapping lines are segmented correctly by applying the proposed techniques discussed in Sections 3.2 and 3.3 and is shown in Figure 10.For the overlapping lines shown in Figure 9 of size 162 × 552, the region for the segmentation of the lines is found by computing the values r 1 , r 2 using Equations ( 7)- (10) and is obtained as 63. and 13, respectively.The average height of the characters for a document in the database images is between 11 and 23 and is dependent on the handwriting of the different authors.
Based on the average height of the characters, the threshold  , computed using Equations ( 1) and ( 2) to detect overlapping lines, is in the range of 91.5 to 140.65 for all the images in the database.This is because of the size variations in the characters written by different authors.To demonstrate the segmentation of overlapping lines using the proposed technique, such lines in  in Figure 8 are shown separately in Figure 9.It is observed that the HP value is non-zero between the overlapping lines because of ascenders and descenders in the language.To segment the overlapping lines, a region is found using the proposed method and it is shown using dotted lines for a better understanding in Figure 9.The overlapping lines are segmented correctly by applying the proposed techniques discussed in Sections 3.2 and 3.3 and is shown in Figure 10.For the overlapping lines shown in Figure 9 of size 162 × 552, the region for the segmentation of the lines is found by computing the values  ,  using Equations ( 7)-( 10) and is obtained as 63.5 and 72.5, respectively. and  are marked manually using dotted lines in Figure 9. Rows 64 to 72 represent the character region, as discussed in Section 3.3.Therefore, rows 1 to 63 are the upper region, and rows 73 to 162 are the lower region.From the database of 402 handwritten document images, 629 overlapping lines are detected, of which 441 are segmented correctly.The overlapping lines in  and  in Figure 8 are segmented correctly and the result of the segmentation is depicted in Figure 11.and 13, respectively.The average height of the characters for a document in the database images is between 11 and 23 and is dependent on the handwriting of the different authors Based on the average height of the characters, the threshold  , computed using Equa tions ( 1) and ( 2) to detect overlapping lines, is in the range of 91.5 to 140.65 for all the images in the database.This is because of the size variations in the characters written by different authors.To demonstrate the segmentation of overlapping lines using the pro posed technique, such lines in  in Figure 8 are shown separately in Figure 9.It is ob served that the HP value is non-zero between the overlapping lines because of ascenders and descenders in the language.To segment the overlapping lines, a region is found using the proposed method and it is shown using dotted lines for a better understanding in Figure 9.The overlapping lines are segmented correctly by applying the proposed tech niques discussed in Sections 3.2 and 3.3 and is shown in Figure 10.For the overlapping lines shown in Figure 9 of size 162 × 552, the region for the segmentation of the lines is found by computing the values  ,  using Equations ( 7)-( 10) and is obtained as 63.5 and 72.5, respectively. and  are marked manually using dotted lines in Figure 9. Rows 64 to 72 represent the character region, as discussed in Section 3.3.Therefore, rows 1 to 63 are the upper region, and rows 73 to 162 are the lower region.From the database of 402 handwritten document images, 629 overlapping lines are detected, of which 441 are seg mented correctly.The overlapping lines in  and  in Figure 8 are segmented cor rectly and the result of the segmentation is depicted in Figure 11.and 13, respectively.The average height of the characters for a document in the databas images is between 11 and 23 and is dependent on the handwriting of the different authors Based on the average height of the characters, the threshold  , computed using Equa tions ( 1) and ( 2) to detect overlapping lines, is in the range of 91.5 to 140.65 for all th images in the database.This is because of the size variations in the characters written by different authors.To demonstrate the segmentation of overlapping lines using the pro posed technique, such lines in  in Figure 8 are shown separately in Figure 9.It is ob served that the HP value is non-zero between the overlapping lines because of ascender and descenders in the language.To segment the overlapping lines, a region is found using the proposed method and it is shown using dotted lines for a better understanding in Figure 9.The overlapping lines are segmented correctly by applying the proposed tech niques discussed in Sections 3.2 and 3.3 and is shown in Figure 10.For the overlapping lines shown in Figure 9 of size 162 × 552, the region for the segmentation of the lines i found by computing the values  ,  using Equations ( 7)-( 10) and is obtained as 63.5 and 72.5, respectively. and  are marked manually using dotted lines in Figure 9. Row 64 to 72 represent the character region, as discussed in Section 3.3.Therefore, rows 1 to 6 are the upper region, and rows 73 to 162 are the lower region.From the database of 40 handwritten document images, 629 overlapping lines are detected, of which 441 are seg mented correctly.The overlapping lines in  and  in Figure 8 are segmented cor rectly and the result of the segmentation is depicted in Figure 11.The remaining short lines in Figure 11 are addressed using the proposed methods discussed in Sections 3.4 and 3.5.The height of the short lines is very small and the characters present in such lines can be as small as a single dot.The short lines are joined perfectly to the appropriate lines, as shown in Figure 12.The thresholds  and  , computed according to the proposed method in Section 3.4 to detect short lines, are in the range of 19 to 39.9 and 2.7145 to 13.3, respectively.These two thresholds are dependent on the average character width per page, which lies between 10 and 22 for the document images in the database.While using the HP method, the gap between a Malayalam letter and a character such as 'Chandrakkala' placed above it, or as in compound characters where a letter is placed below another letter, results in the segmentation of such characters into separate lines called short lines.Since both these cases are not present in Figure 11, the possibility of 'Chandrakkala' , a compound character segmented as a short line, is given in Figures 13-16.In Figure 13, 'Chandrakkala' is incorrectly segmented due to the small gap between the characters above which it is written, and this gap will vary depending on individual handwriting styles.The proposed techniques are able to exactly join 'Chandrakkala' to the correct line, which is shown in Figure 14.While writing a compound character with one letter below another letter, a space is created due to the writing style.Due to this gap, the letter written below is segmented as a short line, which is illustrated in Figure 15.The proposed techniques perfectly rejoin the letter, which is depicted in Figure 16.From the images in the LIPI database, 2607 short lines are detected, out of which 2577 short lines are joined perfectly to the appropriate lines.The remaining short lines in Figure 11 are addressed using the proposed methods discussed in Sections 3.4 and 3.5.The height of the short lines is very small and the characters present in such lines can be as small as a single dot.The short lines are joined perfectly to the appropriate lines, as shown in Figure 12.The thresholds t sh 1 and t sh 2 , computed according to the proposed method in Section 3.4 to detect short lines, are in the range of 19 to 39.9 and 2.7145 to 13.3, respectively.These two thresholds are dependent on the average character width per page, which lies between 10 and 22 for the document images in the database.While using the HP method, the gap between a Malayalam letter and a character such as 'Chandrakkala' placed above it, or as in compound characters where a letter is placed below another letter, results in the segmentation of such characters into separate lines called short lines.Since both these cases are not present in Figure 11, the possibility of 'Chandrakkala', a compound character segmented as a short line, is given in Figures 13-16.In Figure 13, 'Chandrakkala' is incorrectly segmented due to the small gap between the characters above which it is written, and this gap will vary depending on individual handwriting styles.The proposed techniques are able to exactly join 'Chandrakkala' to the correct line, which is shown in Figure 14.While writing a compound character with one letter below another letter, a space is created due to the writing style.Due to this gap, the letter written below is segmented as a short line, which is illustrated in Figure 15.The proposed techniques perfectly rejoin the letter, which is depicted in Figure 16.From the images in the LIPI database, 2607 short lines are detected, out of which 2577 short lines are joined perfectly to the appropriate lines.The remaining short lines in Figure 11 are addressed using the proposed methods discussed in Sections 3.4 and 3.5.The height of the short lines is very small and the characters present in such lines can be as small as a single dot.The short lines are joined perfectly to the appropriate lines, as shown in Figure 12.The thresholds  and  , computed according to the proposed method in Section 3.4 to detect short lines, are in the range of 19 to 39.9 and 2.7145 to 13.3, respectively.These two thresholds are dependent on the average character width per page, which lies between 10 and 22 for the document images in the database.While using the HP method, the gap between a Malayalam letter and a character such as 'Chandrakkala' placed above it, or as in compound characters where a letter is placed below another letter, results in the segmentation of such characters into separate lines called short lines.Since both these cases are not present in Figure 11, the possibility of 'Chandrakkala' , a compound character segmented as a short line, is given in Figures 13-16.In Figure 13, 'Chandrakkala' is incorrectly segmented due to the small gap between the characters above which it is written, and this gap will vary depending on individual handwriting styles.The proposed techniques are able to exactly join 'Chandrakkala' to the correct line, which is shown in Figure 14.While writing a compound character with one letter below another letter, a space is created due to the writing style.Due to this gap, the letter written below is segmented as a short line, which is illustrated in Figure 15.The proposed techniques perfectly rejoin the letter, which is depicted in Figure 16.From the images in the LIPI database, 2607 short lines are detected, out of which 2577 short lines are joined perfectly to the appropriate lines.    .To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines.The result of complete text line extraction is shown in Figure 17.  .To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines.The result of complete text line extraction is shown in Figure 17.  .To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines.The result of complete text line extraction is shown in Figure 17.  .To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines.The result of complete text line extraction is shown in Figure 17.To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines.The result of complete text line extraction is shown in Figure 17.
When a text line is extracted from a vertical stripe, the upper and lower positions of the line in the vertical stripe are stored in an array.When short lines are detected and joined to the correct lines, the position of the newly formed line is stored and that of the short line will be deleted.When a short line is joined to the correct line, the total number of lines extracted from the vertical stripe is reduced accordingly.Similarly, as for the overlapping lines, the positions of newly formed lines after segmenting the overlapping lines will be updated and the total number of lines in the vertical stripe will be increased accordingly.After completing the line extraction separately from each vertical stripe, it has to be joined to the corresponding part of this line in the other two vertical stripes.
One of the challenges encountered is that the lines extracted from the vertical stripe may not have the same height.To overcome this, a template of black pixels with the same size as the binarized document image is created.Then, a text line from vertical stripe 1 is placed in the position from which it is extracted.Similarly, the corresponding part of this text line from vertical stripe 2 and vertical stripe 3 is placed and a complete line is formed.As can be observed, the extracted text line appears identical to the text line in the inverted binarized image.Then, the text line formed is stripped off from the template and the output is shown in Figure 17.When a text line is extracted from a vertical stripe, the upper and lower positions of the line in the vertical stripe are stored in an array.When short lines are detected and joined to the correct lines, the position of the newly formed line is stored and that of the short line will be deleted.When a short line is joined to the correct line, the total number of lines extracted from the vertical stripe is reduced accordingly.Similarly, as for the overlapping lines, the positions of newly formed lines after segmenting the overlapping lines will be updated and the total number of lines in the vertical stripe will be increased accordingly.After completing the line extraction separately from each vertical stripe, it has to be joined to the corresponding part of this line in the other two vertical stripes.
One of the challenges encountered is that the lines extracted from the vertical stripe may not have the same height.To overcome this, a template of black pixels with the same size as the binarized document image is created.Then, a text line from vertical stripe 1 is placed in the position from which it is extracted.Similarly, the corresponding part of this text line from vertical stripe 2 and vertical stripe 3 is placed and a complete line is formed.As can be observed, the extracted text line appears identical to the text line in the inverted binarized image.Then, the text line formed is stripped off from the template and the output is shown in Figure 17.
The process of joining the text lines extracted from the three vertical stripes is performed successfully, exactly matching the text line in the binarized image.An error occurs only if the text lines are not extracted correctly from any of the three vertical stripes.

Analysis of Word Area and Text Line Density in Malayalam Handwritten Documents
The self-similarity and complexity of different shapes in space can be quantitatively expressed using the Minkowski Dimension [30].To calculate this, consider a grid of boxes covering an object in space.Measure the number of boxes that cover the object and repeat the same using boxes scaled at different sizes.The Minkowski Dimension is calculated using Equation (15).The process of joining the text lines extracted from the three vertical stripes is performed successfully, exactly matching the text line in the binarized image.An error occurs only if the text lines are not extracted correctly from any of the three vertical stripes.

Analysis of Word Area and Text Line Density in Malayalam Handwritten Documents
The self-similarity and complexity of different shapes in space can be quantitatively expressed using the Minkowski Dimension [30].To calculate this, consider a grid of boxes covering an object in space.Measure the number of boxes that cover the object and repeat the same using boxes scaled at different sizes.The Minkowski Dimension is calculated using Equation (15).Minkowski Dimension, MD = − log n(s) log(s) (15) where n(s) is the number of boxes with a box size as s.
The Minkowski Dimension of words in ten Malayalam handwritten images is obtained and is shown in Figure 18.Ten pages written by different authors, with the same content, are considered for the analysis.The possible range of values of the Minkowski Dimension, MD, is between 0 and 2 for objects in two-dimensional space.The Minkowski Dimension gives a measure of the area occupied by the word or how densely the characters fill the word.If the value of the Minkowski Dimension lies between 0 and 1, it indicates that the characters fill the word space less densely or it indicates more void spaces in the word.It also indicates that the spacing between the characters in the word is not uniform.If the Minkowski Dimension is between 1 and 2, it indicates that the characters are placed more densely in a word and uniform spacing between the characters inside the word.It gives more insight into the regularity of the strokes, spacing, loops, curves and connections within the word.The Minkowski Dimension can be used to analyze the handwriting styles of different authors using the fractal properties of the handwritten text.Figure 18 shows the values of the Minkowski Dimension for words in ten document images written by 10 different writers, with the same content.The values of the Minkowski Dimension displayed in Figure 18 range between 0.4524 and 1.1581.
cates that the characters fill the word space less densely or it indicates more void spaces in the word.It also indicates that the spacing between the characters in the word is not uniform.If the Minkowski Dimension is between 1 and 2, it indicates that the characters are placed more densely in a word and uniform spacing between the characters inside the word.It gives more insight into the regularity of the strokes, spacing, loops, curves and connections within the word.The Minkowski Dimension can be used to analyze the handwriting styles of different authors using the fractal properties of the handwritten text.The Minkowski Dimension is very useful in document analysis as it gives insight into the complexity of characters, words and text lines.
Text line density is a measure that is useful in the segmentation of text lines.It is found using Equation ( 16  The Minkowski Dimension is very useful in document analysis as it gives insight into the complexity of characters, words and text lines.
Text line density is a measure that is useful in the segmentation of text lines.It is found using Equation ( 16  The accuracy of the detection and joining of short lines, the detection and segmentation of overlapping lines and text line extraction are given in Table 2.The performance of the text line extraction process is quantitatively evaluated using the metrics discussed in the next subsection.

Type of Text Line
No. of Lines No. of Correctly Segmented Lines Accuracy The accuracy of the detection and joining of short lines, the detection and segmentation of overlapping lines and text line extraction are given in Table 2.The performance of the text line extraction process is quantitatively evaluated using the metrics discussed in the next subsection.

Performance Evaluation
The performance of the newly developed text line extraction method is quantitatively evaluated using standard metrics such as the MatchScore, Detection Rate (DR), Recognition Accuracy (RA) and F-measure (FM) [31].

MatchScore
The MatchScore value gives a quantitative measure of the number of ON pixels in the ground truth line matching the ON pixels in the detected line.A MatchScore table is constructed by computing the MatchScore value of a detected text line with all ground truth text lines of the corresponding document.The MatchScore value between the ith detected line and jth ground truth line is calculated using Equation ( 17): where C(X) gives the number of elements in set X, I b is the preprocessed image or binary image that is used to extract the text lines, G j is the set of pixels in the jth ground truth line and D i is the set of pixels in the ith detected line.The range of MatchScore values is between 0 and 1, where 1 indicates the best match.The detected text line or the extracted text line is correct if there exists a one-to-one match between the detected text line and ground truth line and the MatchScore value is greater than a threshold value, T ms .The threshold value chosen is 0.9999, compared to the existing works in [31,32].It is observed that if the MatchScore is greater than 0.9999, the detected text line and ground truth line are perfectly matched.The total number of ground truth (GT) lines is 7535, and 6482 lines are detected from the 402 document images.The detected lines having a one-to-one match with the ground truth lines, with a MatchScore value greater than 0.9999, amount to 6443.This shows that 85.5% of all the correctly detected text lines are perfectly matched with the corresponding ground truth line in all the documents considered for simulation.The number of ground truth (GT) lines and detected (D) lines and the % of detected lines with a MatchScore higher than T ms are depicted in Table 2.
4.4.2.Detection Rate (DR), Recognition Accuracy (RA) and F-Measure (FM) The Detection Rate (DR) is the ratio of the number of correctly detected lines to the given number of ground truth lines in the handwritten image document and is computed as in Equation (18).Detection Rate, DR = N c N g (18) where N c is the number of correctly detected lines and N g is the number of ground truth lines.The detection rate for the experiment is found to be 85.5%.The accuracy of the detected lines is calculated using the recognition accuracy metric as in Equation ( 19): where N d is the number of detected lines.The implementation of the proposed method results in recognition accuracy of 99.39%.
The harmonic mean of DR and RA is called the F-measure (FM) and it is computed using Equation (20).
The value of FM lies between 0 and 1.If FM is 1, it means that DR and RA are the maximum.For the LIPI database, the proposed method results in an F-measure of 91.92%.If DR or RA is zero, then FM is zero.The observed values of DR, RA and FM for the experiment in percentages are given in Table 3.The proposed method is compared with language-independent text line extraction algorithms like A* Path Planning [33] and the piecewise painting algorithm [34].The accuracy of these text line extraction algorithms on the newly developed Malayalam handwritten document image database LIPI is displayed in Table 4.In total, 378 document images from the LIPI database are selected to perform the experiment.The proposed method can extract 4912 text lines perfectly, out of 5599, whereas the result is 3258 and 1495 text lines for the A* Path Planning and piecewise painting algorithms.The proposed method shows higher accuracy of 87.7% compared to 58.19% and 26.7% for A* Path Planning and the piecewise painting algorithm on the LIPI database.From the experiments conducted, it is observed that the piecewise painting algorithm incorrectly segments ascenders and descenders in the Malayalam language, the character 'Chandrakkala' and overlapping lines.This is the reason for its low accuracy of 26.7%.An interesting observation is that the A* Path Planning algorithm segments the text lines containing the character 'Chandrakkala' and the ascenders and descenders almost correctly.However, it fails to segment some of the text lines.Therefore, the accuracy is only 58.19%.To perform the comparison, the database used is the newly created Malayalam handwritten document image database, LIPI.Since the proposed method is specific to the Malayalam language, it will not show good performance for databases of other languages.A database for Malayalam handwritten document images is not publicly available.Therefore, a comparison with another database in the Malayalam language is also not possible.However, the proposed technique is expected to perform well even if new Malayalam handwritten document images are added to the dataset.
The detection rate, recognition accuracy and F-measure of the proposed method, A* Path Planning and piecewise painting algorithm are plotted in Figures 20-22, respectively.The proposed method has the highest Detection Rate of 87.7%, while this value is 58.199% and 26.7% for the A* Path Planning and piecewise painting algorithms, as given in Figure 20.The A* Path Planning and piecewise painting algorithms have low detection rates for the document images in the newly developed LIPI database because these algorithms fail to segment some of the text lines.Moreover, the Recognition Accuracy and F-measure values are the highest for the proposed method, as depicted in Figures 21 and 22, respectively.From the detected lines, the number of text lines that are perfectly matched with the ground truth lines are indicated by the Recognition Accuracy.As given in Figure 21, the piecewise painting algorithm has recognition accuracy of only 33.94%.This is because the algorithm fails to segment characters such as 'Chandrakkala' and ascenders and descenders in the Malayalam language correctly.The Recognition Accuracy obtained for A* Path Planning is 86.35%, which indicates that this language-independent algorithm can effectively segment the ascenders, descenders and 'Chandrakkala' correctly.
handwritten document images are added to the dataset.
The detection rate, recognition accuracy and F-measure of the proposed method, A* Path Planning and piecewise painting algorithm are plotted in Figures 20-22, respectively.The proposed method has the highest Detection Rate of 87.7%, while this value is 58.199% and 26.7% for the A* Path Planning and piecewise painting algorithms, as given in Figure 20.The A* Path Planning and piecewise painting algorithms have low detection rates for the document images in the newly developed LIPI database because these algorithms fail to segment some of the text lines.Moreover, the Recognition Accuracy and F-measure values are the highest for the proposed method, as depicted in Figures 21 and 22, respectively.From the detected lines, the number of text lines that are perfectly matched with the ground truth lines are indicated by the Recognition Accuracy.As given in Figure 21, the piecewise painting algorithm has recognition accuracy of only 33.94%.This is because the algorithm fails to segment characters such as 'Chandrakkala' and ascenders and descenders in the Malayalam language correctly.The Recognition Accuracy obtained for A* Path Planning is 86.35%, which indicates that this language-independent algorithm can effectively segment the ascenders, descenders and 'Chandrakkala' correctly.A new database for Malayalam handwritten document images has been created in this work.To the best of our knowledge, a database for Malayalam handwritten document images is not available publicly.The language-specific text line extraction algorithm proposed in this work segments 85.507% of the text lines perfectly from the document images.Moreover, a total of 7535 ground truth images are created for the text lines in the document images, which are used to evaluate the method proposed in this paper.From the comparisons performed, it is observed that the proposed technique outperforms the  A new database for Malayalam handwritten document images has been created in this work.To the best of our knowledge, a database for Malayalam handwritten document images is not available publicly.The language-specific text line extraction algorithm proposed in this work segments 85.507% of the text lines perfectly from the document images.Moreover, a total of 7535 ground truth images are created for the text lines in the document images, which are used to evaluate the method proposed in this paper.From the comparisons performed, it is observed that the proposed technique outperforms the A new database for Malayalam handwritten document images has been created in this work.To the best of our knowledge, a database for Malayalam handwritten document images is not available publicly.The language-specific text line extraction algorithm proposed in this work segments 85.507% of the text lines perfectly from the document images.Moreover, a total of 7535 ground truth images are created for the text lines in the document images, which are used to evaluate the method proposed in this paper.From the comparisons performed, it is observed that the proposed technique outperforms the language-independent A* Path Planning and piecewise painting algorithms.

Conclusions
The diverse handwriting styles of individual writers make text line extraction from handwritten documents a difficult task.In this paper, a novel method based on the size variations of the written alphabet due to different handwriting styles is proposed to extract the text lines from handwritten Malayalam documents.Various thresholds are developed to perform text line extraction by measuring the average height and width of written characters in a document image.Therefore, these thresholds vary dynamically with each handwriting style.In the proposed technique, horizontal projection (HP) values are used to identify the positions at which to perform text line extraction.The two main problems encountered while using horizontal projection values are the extracted line segments with overlapped multiple lines and the varying gaps due to different handwriting styles between two characters when one character is written below another, resulting in the segmentation of such lines into two separate lines.These are addressed and solved effectively using the proposed method.Overall, 85.507% of the extracted text lines from the newly created LIPI database of Malayalam handwritten document images perfectly match the ground truth lines when evaluated using the metric MatchScore.Moreover, the technique proposed in this paper outperforms language-independent text line extraction algorithms like A* Path Planning and the piecewise painting algorithm on the LIPI database.Due to the unavailability of a Malayalam handwritten document image database, a new database of 402 images is created and is named LIPI.Another major contribution is the ground truth images created for the 7535 text lines in the document images.The proposed method is an initial step in digitizing Malayalam handwritten documents, which will be highly beneficial in enabling individuals to share handwritten documents in their local language.

Figure 1 .
Figure 1.Sample of ascender, descender, Chandrakkala, and compound letters with varying gaps depending on handwriting in Malayalam.

Figure 1 .
Figure 1.Sample of ascender, descender, Chandrakkala, and compound letters with varying gaps depending on handwriting in Malayalam.

Figure 2 .
Figure 2. Flow of the processes involved in text line extraction of Malayalam handwritten documents.

Figure 2 .
Figure 2. Flow of the processes involved in text line extraction of Malayalam handwritten documents.

Figure 3 .
Figure 3. Segmentation of overlapping lines from each vertical stripe into individual lines.

Figure 3 .
Figure 3. Segmentation of overlapping lines from each vertical stripe into individual lines.

Figure 4 .
Figure 4. Sample image of Malayalam handwritten document from LIPI database.

Figure 5 .
Figure 5. Sample image converted to binary.

Figure 4 .
Figure 4. Sample image of Malayalam handwritten document from LIPI database.

Figure 4 .
Figure 4. Sample image of Malayalam handwritten document from LIPI database.

Figure 5 .
Figure 5. Sample image converted to binary.

Figure 7 .
Figure 7. Binarized handwritten document image divided into three vertical stripes.

6 .
(a) Ground truth of the first text line in the image shown in Figure 5; (b) ground truth of the fifth text line in the image shown in Figure 5.

Figure 6 .
Figure 6.(a) Ground truth of the first text line in the image shown in Figure 5; (b) ground truth of the fifth text line in the image shown in Figure 5.

Figure 7 .
Figure 7. Binarized handwritten document image divided into three vertical stripes.

Figure 7 .
Figure 7. Binarized handwritten document image divided into three vertical stripes.
5 and 72.5, respectively.r 1 and r 2 are marked manually using dotted lines in Figure Rows 64 to 72 represent the character region, as discussed in Section 3.3.Therefore, rows 1 to 63 are the upper region, and rows 73 to 162 are the lower region.From the database of 402 handwritten document images, 629 overlapping lines are detected, of which 441 are segmented correctly.The overlapping lines in VS 1 and VS 2 in Figure 8 are segmented correctly and the result of the segmentation is depicted in Figure 11.Appl.Sci.2023, 13, x FOR PEER REVIEW 11 of 21

Figure 8 .
Figure 8.Text lines extracted from vertical stripes.

Figure 9 .
Figure 9. Manually marked region of segmentation found using the proposed technique.

Figure 10 .
Figure 10.Result of applying the proposed method to segment overlapping lines.

Figure 8 .
Figure 8.Text lines extracted from vertical stripes.

Figure 8 .
Figure 8.Text lines extracted from vertical stripes.

Figure 9 .
Figure 9. Manually marked region of segmentation found using the proposed technique.

Figure 10 .
Figure 10.Result of applying the proposed method to segment overlapping lines.

Figure 9 .
Figure 9. Manually marked region of segmentation found using the proposed technique.

Figure 8 .
Figure 8.Text lines extracted from vertical stripes.

Figure 9 .
Figure 9. Manually marked region of segmentation found using the proposed technique.

Figure 10 .
Figure 10.Result of applying the proposed method to segment overlapping lines.

Figure 10 .
Figure 10.Result of applying the proposed method to segment overlapping lines.

Figure 11 .
Figure 11.Result after segmenting the overlapping lines detected in the vertical stripes.

Figure 12 .
Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.

Figure 11 .
Figure 11.Result after segmenting the overlapping lines detected in the vertical stripes.

Figure 11 .
Figure 11.Result after segmenting the overlapping lines detected in the vertical stripes.

Figure 12 .
Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.

Figure 12 .
Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.

Figure 14 .
Figure 14.Result of applying proposed method to join short line containing 'Chandrakkala' back to the original line.

Figure 15 .
Figure 15.Short line containing the character written below an alphabet.

Figure 16 .
Figure 16.Result of applying proposed method to join short line back to the original line.

Figure 14 .
Figure 14.Result of applying proposed method to join short line containing 'Chandrakkala' back to the original line.

Figure 15 .
Figure 15.Short line containing the character written below an alphabet.

Figure 16 .
Figure 16.Result of applying proposed method to join short line back to the original line.

Figure 14 . 21 Figure 13 .
Figure 14.Result of applying proposed method to join short line containing 'Chandrakkala' back to the original line.

Figure 14 .
Figure 14.Result of applying proposed method to join short line containing 'Chandrakkala' back to the original line.

Figure 15 .
Figure 15.Short line containing the character written below an alphabet.

Figure 16 .
Figure 16.Result of applying proposed method to join short line back to the original line.

Figure 15 . 21 Figure 13 .
Figure 15.Short line containing the character written below an alphabet.

Figure 14 .
Figure 14.Result of applying proposed method to join short line containing 'Chandrakkala' back to the original line.

Figure 15 .
Figure 15.Short line containing the character written below an alphabet.

Figure 16 .
Figure 16.Result of applying proposed method to join short line back to the original line.

Figure 16 .
Figure 16.Result of applying proposed method to join short line back to the original line.

Figure 17 .
Figure 17.Text lines extracted from handwritten document image in Figure 4.

Figure 17 .
Figure 17.Text lines extracted from handwritten document image in Figure 4.
Figure 18 shows the values of the Minkowski Dimension for words in ten document images written by 10 different writers, with the same content.The values of the Minkowski Dimension displayed in Figure 18 range between 0.4524 and 1.1581.

Figure 18 .
Figure 18.Minkowski dimension for words in 10 handwritten Malayalam pages.
): Text line density = (16) Text line density analysis gives more information about the authors, writing tools and structures of documents.The text line density of 15 Malayalam handwritten pages written by 15 authors is shown in Figure 19.It ranges between 14.06% and 29.7%, which indicates that the text lines are less densely packed, with larger spacing and large margins.

Figure 18 .
Figure 18.Minkowski dimension for words in 10 handwritten Malayalam pages.
): Text line density = Area o f text lines in the document Area o f the document (16) Text line density analysis gives more information about the authors, writing tools and structures of documents.The text line density of 15 Malayalam handwritten pages written by 15 authors is shown in Figure 19.It ranges between 14.06% and 29.7%, which indicates that the text lines are less densely packed, with larger spacing and large margins.

Figure 19 .
Figure 19.Text line density in 15 Malayalam handwritten pages.

Figure 20 .
Figure 20.Detection Rate of proposed method, A* Path Planning and piecewise painting algorithm.Figure 20.Detection Rate of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 20 .
Figure 20.Detection Rate of proposed method, A* Path Planning and piecewise painting algorithm.Figure 20.Detection Rate of proposed method, A* Path Planning and piecewise painting algorithm.

21 Figure 21 .
Figure 21.Recognition Accuracy of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 22 .
Figure 22.F-measure of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 21 . 21 Figure 21 .
Figure 21.Recognition Accuracy of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 22 .
Figure 22.F-measure of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 22 .
Figure 22.F-measure of proposed method, A* Path Planning and piecewise painting algorithm.
(rh p 1 < rh p 2 ), Join rows ch o beg to rh p 1 − 1 to the bottom of U ov ; Join rows rh p 2 + 1 to ch o end to the top of L ov .Otherwise, Join rows ch o beg to rh p 2 − 1 to the bottom of U ov ; Join rows rh p 1 + 1 to ch o end to the top of L ov .

Table 1 .
Overview of the database.

Table 1 .
Overview of the database.

Table 1 .
Overview of the database.

Table 2 .
Accuracy of the proposed method in segmenting text lines, overlapping lines and short lines.

Table 2 .
Accuracy of the proposed method in segmenting text lines, overlapping lines and short lines.

Table 3 .
Evaluation of text line segmentation using the detection rate (DR), recognition accuracy (RA) and F-measure (FM).

Table 4 .
Comparison of accuracy obtained for different existing text line extraction methods and the proposed method.