1. Introduction
Geological borehole data serve as an important record and a primary outcome of geological exploration, providing essential subsurface information that supports a wide range of geoscientific and engineering applications. They play a vital role in mineral resource prediction by supplying key geological parameters for identifying and evaluating resource potential [
1,
2]; in groundwater monitoring by offering reliable data for assessing aquifer conditions and dynamics [
3,
4]; and in geological hazard prevention and mitigation through informing assessments of potential risks such as landslides, subsidence, and seismic activity [
5,
6,
7]. Borehole data are also utilized in environmental monitoring to track subsurface environmental changes and contamination [
8], and in engineering construction to provide site-specific geological information essential for safe and efficient design and implementation of infrastructure projects [
9,
10]. With the widespread adoption of computer technology, some borehole data have transitioned from traditional paper-based records to digital formats such as Excel, Word, PDF, and MapGIS [
11]. However, due to historical constraints and data generation limitations, some borehole records still exist as paper-based images, exhibiting characteristics of multi-source heterogeneity and unstructured data [
12]. To efficiently process and analyze such data, they must first be converted into borehole log images, followed by vectorization, line extraction, and text recognition to extract valuable information. Among these steps, text is the most abundant and essential component, and it is typically extracted with high precision using Optical Character Recognition (OCR) technology.
OCR technology, a fundamental method for automatically recognizing text information, has progressively evolved since its inception and patenting by German scientist Tauschek in 1929. In recent years, open-source OCR techniques have gained widespread adoption, with Tesseract-OCR [
13] and Paddle-OCR [
14,
15] demonstrating notable effectiveness in text recognition tasks. Traditional OCR techniques rely on digital image processing and statistical machine learning models. For instance, Zhang et al. [
16] enhanced printed text recognition by integrating connected component analysis with an improved SVM algorithm. Similarly, Narang et al. [
17] employed a combination of cosine similarity analysis and the Adaboost model to recognize specific handwritten scripts, achieving an accuracy of 91.7%. However, such methods depend on precise text segmentation and are highly susceptible to noise, font variations, and other distortions. On the other hand, deep learning-based approaches encompass both single-text segmentation and sequential line recognition, with the latter being the most widely adopted. For example, Shi et al. [
18] implemented an end-to-end sequence recognition approach using a CRNN model, effectively modeling text as a sequence without needing character-level segmentation while incorporating context information. This method has demonstrated impressive accuracy, reaching 75.1%, 81.7%, and 84.3% across different datasets. These advancements have significantly enhanced the performance of OCR technology across various text recognition tasks.
Researchers have enhanced OCR accuracy by optimizing image preprocessing, refining models, and improving post-processing techniques. For image preprocessing studies, Shin et al. [
19] applied morphological filtering to images, achieving an accuracy of 92.82%. Similarly, Michalak and Okarma [
20] introduced a local image entropy filtering method to address uneven lighting, enhancing OCR accuracy by refining image features. Regarding model optimization, Sporici et al. [
21] improved OCR performance by applying convolutional processing before feeding images into a general recognizer, significantly boosting the F1 score to 0.729. Additionally, some researchers have enhanced recognition accuracy by integrating multiple models, optimizing OCR results from a model aggregation perspective [
22,
23]. Recent advances in this direction include the LMV-RPA framework, which integrates outputs from multiple OCR engines and large language models via a voting strategy to achieve 99% accuracy [
24], and the Consensus Entropy approach proposed by Zhang et al., which selects the most reliable output from multiple vision–language model-based OCR systems based on their agreement characteristics [
25]. In the context of post-OCR correction, Ramirez-Orta et al. demonstrated that combining multiple character-level sequence-to-sequence models through a voting scheme can substantially improve OCR text correction performance [
26]. For post-processing approaches, Kumar and Ramakrishnan [
27] proposed a dictionary-based method for error correction, demonstrating its effectiveness in improving OCR accuracy. However, although post-processing can improve OCR accuracy to some extent, its effectiveness is heavily contingent on the quality of preprocessing and model optimization. If the pre-processing is not in place, it is difficult for post-processing to make up for the shortcomings effectively. Therefore, these studies show that although image pre-processing, model optimization and post-processing techniques can enhance OCR accuracy to a certain extent, significant room for improvement remains, particularly in refining preprocessing methods.
Although OCR techniques have achieved relatively accurate recognition, their application in specific domains, such as geological borehole data, remains limited. Most existing methods focus on general text recognition, making them less effective for complex and highly structured image data like borehole image data. For instance, Zhang et al. [
12] automated the processing of borehole logs using the Hough transform and corner tagging method combined with the Tesseract-OCR engine. While the method achieved high accuracy in simple text recognition, its performance declined for complex text, resulting in an overall recognition rate of 90%. The process involved image preprocessing, layout analysis, form structure extraction, and text recognition, enabling initial automation of information extraction. However, its precision and recall still require significant improvement when handling complex forms and information-dense content.
The accuracy of text recognition is significantly impeded by the presence of complex backgrounds [
28], encompassing factors such as image distortion, blurriness, and luminance fluctuations. To address these challenges, prior studies have explored advanced methodologies to optimize OCR performance under intricate background conditions [
29,
30]. However, most existing studies focus on how background interference affects text recognition, while the impact of canvas padding on recognition accuracy after text segmentation has received little attention. In practical applications, the canvas padding may influence the spatial configuration of textual elements and compromise the efficiency of feature extraction, thereby exerting a subtle yet consequential impact on recognition precision.
This paper investigates the impact of varying canvas padding on text recognition accuracy while keeping text size constant and without altering OCR algorithms or applying post-processing techniques, offering novel insights and methodologies for optimizing OCR performance. Specifically, this paper improves the accuracy of borehole log text extraction through optimized image preprocessing, adjusts the canvas padding around text, and introduces the multi-layout adjustment voting mechanism. In the text recognition phase, we incorporate the multi-layout adjustment voting strategy, which adjusts the canvas padding around text and votes on the recognition results of multiple layouts to enhance accuracy further.
The main contributions of this paper are as follows:
(1) We propose a novel multi-layout adjustment voting strategy that enhances text recognition without modifying the OCR engine or adding post-processing.
(2) We introduce a layout variation method based on canvas padding to reduce recognition uncertainty under a single layout.
(3) The effectiveness of this method is verified through experimental comparison of two drilling datasets, providing new ideas for improving text recognition accuracy in other fields.
2. Materials and Methods
2.1. Overall Approach
Text recognition in borehole log images faces many challenges, such as complex backgrounds, noise interference, and irregularities in the table structure, leading to lower recognition accuracy. This is especially true for complex tables with dense information. To address this problem, this paper proposes a multi-layout adjustment voting strategy. It enhances image preprocessing, adjusts the canvas padding around text in borehole log images, and applies comprehensive voting based on multiple canvas padding. The method consists of two main components: (1) Based on the height of the Chinese text line in the borehole log images, it dynamically adjusts the blank area around the text to build a variety of text layouts with different blank area sizes. (2) By integrating a voting mechanism, OCR recognition results from multiple layouts are analyzed and voted on to improve text recognition accuracy. The overall workflow is shown below.
Figure 1 shows the multi-layout recognition and voting process for borehole log image text. The process started with data preprocessing and region cropping to extract regions of interest (ROIs) from the raw borehole images. These red boxes indicate the segmented text-line regions, which serve as the inputs for OCR recognition. Then, multiple layouts were applied to the content of the acquired text lines to generate text images adapted to different canvas padding sizes by dynamically adjusting the blank area around the text. Subsequently, text recognition was performed on these text images with different layouts using OCR recognition tools. Finally, the text recognition results were output through a comprehensive voting mechanism based on multiple layout recognition results. The comprehensive voting mechanism determines the final output based on the frequency of recognition results across different layouts. As illustrated in the figure, the character “石 (Shi)” appears four times, which is the highest among all candidates, and is thus chosen as the final result.
2.2. Data Preprocessing
The extraction of lines and text in the borehole log images depends heavily on image quality. This is particularly true for scanned paper images, where issues such as printing defects, wear, uneven lighting, and noise can hinder information recognition. Therefore, image enhancement is necessary to improve clarity and optimize the extraction process. The enhancement process primarily involves grayscale conversion and binarization, skew correction, and morphological erosion and dilation.
- (a)
Grayscale conversion and binarization
By converting the image from RGB format to grayscale, the amount of computation can be reduced, and the processing efficiency can be improved. A local adaptive thresholding method based on a Gaussian-weighted neighborhood mean was applied to enhance binarization under uneven illumination. (
Figure 2).
- (b)
Skew Correction
The borehole log images exhibit a distinct tabular frame structure. Thus, a skew correction algorithm was employed based on Canny edge detection [
31] and the Hough transform. First, Canny edge detection was used to denoise, smooth the image, and extract edge information. Then, the Hough transform detected straight lines and calculated their tilt angles (ranging from −45° to 45°). Based on the detected skew angle, the image was rotated for correction (
Figure 3), ensuring the accurate orientation of the table frame and text. This method effectively corrects skew resulting from improper document placement during scanning, thereby improving the accuracy of subsequent information extraction.
- (c)
Morphological Erosion and Dilation
Morphological operations, through erosion and dilation, remove noise and repair gaps and missing parts in the image. The erosion operation reduces the size of the target objects in the image, while dilation increases their size. By combining these two operations, missing sections caused by issues such as unclear printing or uneven lighting can be effectively restored, thereby enhancing the target features. The “Removing noise” step (
Figure 4) uses a morphological operation sequence with a 2 × 2 rectangular structuring element, which effectively reduces scanning noise and repairs incomplete lines or characters caused by printing defects or uneven illumination.
Based on data preprocessing, the region of interest (ROI) in borehole images was extracted using a combination of layout analysis [
32], frame line extraction [
33], and the corner tagging method. The method effectively separates table content from text content, thus ensuring accurate extraction of subsequent textual information.
2.3. OCR Optimization Based on Multi-Layout Voting
The development of OCR engines for general domains has been rapid, achieving high efficiency and accuracy in tasks involving plain text images. For example, Paddle-OCR claims to reach an accuracy of around 95%. However, in practical recognition tasks, particularly in specialized fields such as geological borehole log images (the target domain of this study), the recognition accuracy falls significantly short of the advertised rates.
This study proposes an OCR optimization method based on input image layout optimization. After image preprocessing, the input text image was processed using horizontal and vertical projections to focus the text. Subsequently, the layout of the text in the image was optimized by adjusting the canvas padding. The processed image was then fed into the OCR system (Paddle-OCR and Tesseract-OCR were used in this study). Finally, multiple recognition results were obtained by inputting text images with different layouts, and a voting mechanism was used to optimize the final results. The specific process is as follows:
STEP 1: Text Line Focusing
First, potential segmentation locations were identified by projecting the image vertically and horizontally, and neighboring regions were merged to extract character regions (as shown in
Figure 5a). In addition, the accuracy of recognition may be affected for text lines with large character spacing. Hence, they must be adjusted appropriately to optimize the recognition effect (as shown in
Figure 5b).
STEP 2: Text Layout Adjustment
After obtaining the text image segmented into lines in STEP 1, further layout adjustments were performed on the resulting text line images. The layout adjustment process includes both line text focusing and canvas padding. Line text focusing was mainly used to optimize the text lines with word spacing that was too wide. The method sets a threshold(H/4) to segment text with normal spacing and then reassembles the text blocks to restore reasonable word spacing. This process generates a more standardized text line image (
Figure 6a). The purpose of background resizing is to optimize the layout of the text in the image by changing the canvas padding. Taking a text line image of size H×W as an example, where H denotes the height and W denotes the original width of the text line image. Based on OpenCV, the canvas padding of the text line image was adjusted using H as a benchmark to optimize the overall layout. Specifically, 2H(W + H) denotes resizing the background width to 2H and the length to W + H, where the expanded region is filled with a white background (as shown in
Figure 6b). Other canvas padding adjustments follow this pattern. For each distinct text line image, a set of differently laid-out images is generated and subsequently fed into the OCR engine.
STEP 3: Multi-layout recognition voting
After adjusting the layout of the text line image, it was fed into the OCR engine to obtain recognition results. Since different layouts may have differences in the overall recognition enhancement, a voting mechanism was introduced to count and vote on the recognition results under multiple layouts to determine the final output. The voting mechanism works by counting the frequency of each recognition result across different layouts and selecting the one with the highest frequency as the final output. When calculating the frequency, if the recognition result of a certain layout is empty, it is not counted in the statistics. For example, if Layout 4 recognizes a text line as “black mica schist” while other layouts produce no result, the probability of “black mica schist” is 1/1. Eventually, the recognized text is determined based on the voting results and then output. In the case of a tie during the voting process, the first among the tied results in the original recognition order is selected as the final output. If all recognition results are empty, the output is set as an empty string.
4. Experimental Analysis
From
Table 2, Paddle-OCR demonstrates significantly higher overall accuracy than Tesseract-OCR. This difference is primarily due to the test data being a Chinese geological borehole log image, which contains numerous complex geological terms. Tesseract-OCR struggles with recognizing such domain-specific Chinese content. In addition to the borehole-specific datasets, we also evaluated the multi-layout voting strategy on general-purpose data, including street scene images and synthetic Chinese text. As shown in
Table 3, the proposed method improves the F1-score by 1.6–2.8 percentage points for PaddleOCR and by 2–4 percentage points for Tesseract. Notably, the baseline recognition performance (without multi-layout) of both PaddleOCR and Tesseract on these datasets aligns with the results reported in recent OCR benchmarking studies, such as the 2024 open-source evaluation of 12 OCR engines [
34]. This consistency confirms the reliability of our experimental results. Moreover, our multi-layout voting strategy demonstrates additional performance gains on top of these already-representative baselines, thereby validating its effectiveness in improving OCR accuracy even on general datasets.
As shown in
Table 2, recognition accuracy changes as the original layout H was gradually adjusted to 2H, 3H, and 4H. The overall trend shows that increasing canvas padding sizes gradually improves recognition accuracy, indicating that layout adjustment contributes to better text recognition in borehole log images. However, the accuracy gains stabilize beyond a certain range with minor variations. After applying the multi-layout voting strategy, recognition accuracy, recall, and F1 score show significant improvements over the results of layout adjustment alone. For instance, in the Dayingezhuang gold deposit dataset, the “Multi-layout voting (Paddle)” increases the F1 score from 95.93% to 97.96%, while the “Multi-layout voting (Tesseract)” improves the F1 score from 59.88% to 61.82%. These results confirm the effectiveness of this optimization strategy in enhancing text recognition accuracy for borehole log images.
Table 2 illustrates that “Multi-layout voting (Paddle)” and “Multi-layout voting (Tesseract)” achieved the highest recognition performance on the Dayingzhuang dataset, with F1 scores of 97.96% and 61.82%, respectively. This outcome is attributed to the poor quality of the Dingjiashan dataset, which has some noise issues. In contrast, the Dayingzhuang dataset exhibits more uniform text distribution, with standardized line and word spacing, which contributes to improved recognition performance. These findings indicate that the quality of the original data significantly impacts text recognition accuracy.
As shown in
Table 2, for results with high accuracy after preprocessing, optimizing the canvas padding yields only marginal improvements in recognition accuracy. For instance, on the Dayingezhuang dataset, the F1 score increases by merely 2%. In contrast, when preprocessing accuracy is lower, comprehensive adjustments to the canvas padding sizes result in more significant improvements in recognition performance. For example, on the Dingjiashan dataset, the F1 score increases by approximately 10%. This suggests that the integrated multi-layout adjustment voting strategy has a more pronounced enhancement effect on cases with lower initial recognition accuracy.
When comparing the results in
Table 2 and
Table 3, it is evident that Tesseract consistently underperforms relative to PaddleOCR across all datasets. This performance gap is particularly pronounced in the street scene image dataset in
Table 3, where Tesseract’s F1-score remains below 16%, compared to over 70% for PaddleOCR. This substantial difference can be attributed to several factors. First, Tesseract was primarily designed for printed documents and lacks robust pretraining for complex real-world scenes with distorted or cluttered text layouts. Second, its support for Chinese character sets—especially in scene text where font, orientation, and background vary greatly—is limited compared to deep learning-based engines like PaddleOCR. As a result, Tesseract struggles to generalize to unconstrained environments such as natural scene text, which often involves noisy backgrounds, irregular lighting, and artistic or non-standard fonts. This explains the large performance gap observed in
Table 3 and further underscores the advantage of using more modern, learning-based OCR engines in real-world applications.
In summary, the multi-layout adjustment voting optimization strategy significantly improves the precision, recall, and F1-score of OCR engines on borehole log images, with recall showing the most notable enhancement. This approach not only ensures high recognition accuracy but also significantly enhances the completeness of the recognition results. Despite overall improvements, the proposed method still has some limitations. Recognition errors may occur in areas with poor image quality, overlapping symbols, or irregular text backgrounds. For instance, in the Dingjiashan dataset, geological symbols placed close to Chinese characters sometimes lead to misrecognition, particularly with Tesseract-OCR. In addition, slanted or occluded text lines can cause segmentation failures and reduce recognition accuracy. These cases suggest that the method’s effectiveness may be limited under severe noise or non-standard layouts. Future work will consider integrating post-processing techniques, such as language models or domain-specific correction tools, to enhance robustness.