Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

Hashmi, Khurram Azeem; Pagani, Alain; Liwicki, Marcus; Stricker, Didier; Afzal, Muhammad Zeshan

doi:10.3390/app11167610

Open AccessArticle

Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

by

Khurram Azeem Hashmi

^1,2,3,*

,

Alain Pagani

³,

Marcus Liwicki

⁴

,

Didier Stricker

^1,3 and

Muhammad Zeshan Afzal

^1,2,3,*

¹

Department of Computer Science, Technical University, 67663 Kaiserslautern, Germany

²

Mindgarage, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany

³

German Research Institute for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany

⁴

Department of Computer Science, Luleå University of Technology, 971 87 Luleå, Sweden

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2021, 11(16), 7610; https://doi.org/10.3390/app11167610

Submission received: 5 July 2021 / Revised: 6 August 2021 / Accepted: 17 August 2021 / Published: 19 August 2021

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a novel architecture for detecting mathematical formulas in document images, which is an important step for reliable information extraction in several domains. Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in computer vision. In this paper, we suggest a couple of modifications to the existing Cascade Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of conventional convolutions in the backbone network to spot areas of interest better. Second, it uses a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas on the Marmot dataset, which results in a relative error reduction of 30%.

Keywords:

formula detection; Cascade Mask R-CNN; mathematical expression detection; document image analysis; deep neural networks; computer vision

1. Introduction

Information extraction from document images is a primary need in various domains such as banking, archiving, or academia and industry in general. Research in document analysis has been trying to develop precise information extraction systems for several years [1,2,3,4]. Although state-of-the-art optical character recognition (OCR) systems [5,6] recognize regular text with high accuracy, they are vulnerable to recognize information from page objects (tables, figures, mathematical formulas) in document images [7,8]. Figure 1 illustrates the problem in which an open-source OCR, Tesseract [4] (we use the LSTM-based version 4.1.1 available at https://github.com/tesseract-ocr/tesseract accessed on 5 July 2021), is applied to extract the content from a document image. Besides recognizing the textual content, the OCR fails to extract the information from mathematical formulas. This shows that formula detection is a crucial preliminary step for information extraction in such document images.

Mathematical formulas are an integral part of documents because they allow us to represent complex information concisely by exploiting mathematics capabilities. Formulas present in the documents are categorized into isolated formulas (mentioned in a separate line) and embedded formulas (inline mathematical symbols). Figure 2 exhibits the problem of detecting isolated and embedded formulas in document images.

The task of detecting both isolated and embedded formulas in document images is a difficult problem because of the underlying low inter-class and high intra-class variance [10]. The hurdles involved in detecting isolated and embedded formulas are exhibited in Figure 2. The isolated formulas present in a document image can easily be misclassified with other page objects due to low inter-class variance with tables, algorithms, and figures. The embedded formulas contain mathematical functions (

l o g, e x p, t a n

), operators (

\times, +, σ, %

), and variables (

i, j, k

). These inline expressions are prone to be misinterpreted with the regular text in a document image [11].

Previous works employed hand-crafted features to detect formulas in documents [2,12,13]. Although these systems extract mathematical formulas, they fail to obtain effective results on generic datasets. Later, statistical learning, mainly machine learning-based methods, advanced the performance of formula identification systems [14,15,16]. The recent success of deep learning-based methods on computer vision within the last decade also had an impact on the task of formula detection in scanned document images. Several deep learning-based formula detection approaches [17,18,19,20] have been presented in the past two years. They are mainly equipped with object detection algorithms such as Faster R-CNN [21], YOLO [22], SSD [23], and FPNs [24].

In recent work, Agarwal et al. [25] presented a method equipped with Cascade Mask R-CNN [26] to tackle the problem of table detection in document images. However, the capabilities of Cascade Mask R-CNN have not been investigated yet in the domain of mathematical formula detection in document images.

This paper presents an end-to-end data-driven approach to detect both isolated and embedded formulas in document images. The main contributions of this paper are as follows:

We present an end-to-end trainable framework that operates on a Cascade Mask R-CNN equipped with a deformable composite backbone to detect both isolated and embedded formulas in document images.
Unlike prior work, our formula detection pipeline operates on a lightweight dilation method as a pre-processing step.
We accomplish state-of-the-art results in detecting isolated formulas on a higher IoU threshold in the ICDAR-2017 POD dataset [27]. Furthermore, on the Marmot dataset [9], we surpass previous state-of-the-art results on embedded formulas with a huge margin and achieve identical results with prior state-of-the-art on isolated formulas.

2. Related Work

Research progress in the field of document image analysis directly relates to advances in the computer vision research community. The task of formula detection in documents is a well-studied problem [28]. Noticeable progress has been achieved in this domain by implementing custom-heuristics to deep learning-based approaches. Earlier, rule-based approaches developed character-based heuristics to identify formulas in documents [29,30,31,32]. These techniques look for special characters (e.g., “>”, “×”, “=”) that mainly exist in mathematical formulas.

Kacem et al. [12] introduced a model based on fuzzy logic to detect mathematical symbols. The approach predicts the formula region by exploiting the features of mathematical symbols. Inoue et al. [2] first employed a conventional OCR method to extract characters. The method treated all the remaining characters as mathematical symbols that OCR was unable to parse.

Specific OCR systems have been presented that recognize mathematical symbols based on their positions and sizes [2]. Baker et al. [13] segregated the lines containing formulas to the regular textual lines in order to detect isolated formulas in PDF documents.

Decision trees have been equipped to detect isolated formulas by classifying formula lines with the plain text lines [33]. Chang et al. [15] proposed a similar method based on the projection of the features that only works for isolated formulas in documents.

Later, machine learning-based algorithms were proposed to alleviate the performance of formula detection systems in documents [14,34]. Liu et al. [16] leveraged the combination of Conditional Random Field (CRF) and Support Vector Machine (SVM) to classify sparse lines in documents. Subsequently, the method distinguished formulas from other graphical page objects such as figures and tables by applying custom heuristics.

Succeedingly, researchers have investigated the capabilities of Deep Neural Networks (DNNs) for the problem of formula identification in document images [27,35]. To the best of our knowledge, He et al. [36] exploited Convolutional Neural Networks (CNNs) with spatial context to detect mathematical symbols in document images. Later, Gao et al. [37] presented a deep learning-based formula detection system in PDF documents.

NLPR-PAL [27] produced the best results in the competition of POD at ICDAR-2017. They proposed a blend of connected components, SVM, and Faster R-CNN [21] to detect figures, formulas, and tables in document images.

Yi et al. [38] published another CNN-based approach that detects graphical page objects such as tables, figures, and formulas in document images. The authors employed the dynamic programming technique instead of Non-Maximum Suppression (NMS) to refine the final candidate proposals. Semantic segmentation-based architecture such as U-Net [39] has also been utilized to detect mathematical expressions in scientific document images [17].

Recently, Phong et al. [18] published a method equipped with YOLO [40] to detect mathematical formulas in document images. In another approach [19], SSD [23] was exploited to detect mathematical expressions in PDF documents.

Another graphical page object detection system was published by Li et al. [41]. The authors combined deep structure prediction with a traditional approach to detecting page objects, including formulas in document images. Younas et al. [20] introduced a system called Fi-Fo that detects figures and formulas in document images. The authors empirically established that deformable convolutions [42] with Feature Pyramid Networks (FPN) [24] are a better fit as compared to other object detection algorithms. The proposed approach heavily relied on the image transformation pre-processing techniques to produce state-of-the-art results.

3. Method

The presented approach is comprised of Cascade Mask R-CNN [43] equipped with a recently published composite backbone having deformable convolutions replaced with traditional convolution filters. Figure 3 illustrates the complete pipeline of our proposed framework. In this section, we dive deeper into each component of our proposed method.

3.1. Cascade Mask R-CNN

We treat the problem of formula detection in document images as an object detection problem on natural images. Recently, Cai and Vasconcelos [26] introduced Cascade R-CNN [26] that extends the concept of the idea of Faster R-CNN [21] by adding multi staging technique. In our approach, we incorporate the instance segmentation branch as proposed in the original Mask R-CNN [43].

As explained in Figure 3, the input image is passed through the composite ResNeXt-101 backbone, which is explained in Section 3.2. The backbone extracts the spatial features and generates feature maps. The Region Proposal Network (RPN) head estimates the possible candidate regions where formulas can be present. The first bounding box component receives the features from the RPN and creates predictions. Each of the three bounding box modules performs classification and regression. The classification score and bounding box coordinates predicted by each bounding box head,

B H 1

,

B H 2

, and

B H 3

, are denoted with

(C 1, B 1)

,

(C 2, B 2)

, and

(C 3, B 3)

, respectively. The output of one bounding box head becomes the training input for the next head. This cascaded regression and classification method optimizes the process of differentiating false positive samples with true positives even at higher IoU thresholds. After computing the refined bounding boxes and classification scores from

B H 3

, the segmentation head predicts the mask that contributes to the loss function to optimize the training further.

3.2. Composite Backbone

We employ a robust and novel dual backbone architecture to extract the possible spatial features to detect formulas in document images. The performance of any object detection algorithm depends on the quality of the feature map it receives from the feature extraction network [44]. In this paper, we implement a dual backbone-based network [45] in which the first backbone is the assistance backbone, and the other is known as the lead backbone. Both of the backbones are compositely connected to each other so that the assistant backbone’s output features are treated as input features for the lead backbone. Figure 4 illustrates the architecture of our dual composite backbone.

For the conventional convolutional network with single backbone, the output of (

l - 1

)-th stage is propagated as input to the l-th stage, which is given by

x^{l} = F^{l} (x^{l - 1}), l \geq 2

(1)

where

F^{l}

represents the non-linear function on l-th level. Contrary to this, our backbone network receives input from prior levels and parallel level of the assistant backbone. Therefore, the input of a lead backbone

b l

at stage l is the product of output of lead backbone at (

l - 1

)th stage and parallel

l - t h

stage of assistant backbone

b a

. Mathematically, it is explained in [45] as

x_{b l}^{l} = F_{b l}^{l} ((x_{k}^{l - 1}) + g (x_{b a}^{l})), l \geq 2

(2)

where g defines the composite connection between the lead and assistant backbone, and these composite connections enable the lead backbone to extract essential spatial features. Table 1 outlines the architectural details of the employed dual ResNeXt-101 backbone network. As explained in Figure 3, we propagate the output of the final lead backbone to the region proposal network of our Cascade Mask R-CNN.

3.3. Deformable Convolution

We incorporate deformable convolution filters [42] instead of the conventional convolutions that exist in the ResNeXt-101 architecture [46]. The convolutional neural networks extract the important spatial features that are essential to perform the required task. Based on the hierarchy, convolutional layers discover different features [47]. Convolutional layers present at the bottom search for crude features such as sharp edges or the gradients, whereas the layers at higher levels look for the abstract components such as complete object [48]. The conventional convolution operation has the same effective receptive field for all the neurons. The 2D convolution is comprised of two parts: (1) the first step samples the input feature map through a grid R, and (2) aggregation of sample values is multiplied by the weight

w

. For conventional convolution, the output of feature map y for each position

p_{o}

is elaborated in [42] as follows:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \times x (p_{0} + p_{n})

(3)

where x represents the input feature map, and

p_{n}

iterates over the locations in a grid R that can be defined as

R = (- 1, - 1), (- 1, 0), (- 1, 1), (0, - 1), (0, 0), (0, 1), (1, - 1), (1, 0), (1, 1)

for a 3 × 3 convolutional layer. The effective receptive field of such a filter is restricted to these nine positions.

In the case of deformable convolution, an additional offset represented as

Δ (p_{n})

is added, which deforms the filter’s receptive field by augmenting the predefined offsets. Hence, Equation (3), as explained in [42], is transformed into

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \times x (p_{0} + p_{n} + Δ p_{n})

(4)

This modification makes the sampling process irregular with an offset value of

p_{n} + Δ (p_{n})

. As these offsets are differentiable and fractional, bilinear interpolation is used to implement them. Considering

p = p_{0} + p_{n} + Δ (p_{n})

, the bilinear interpolation is implemented as follows:

x (p) = \sum_{q} G (q, p) \times x (q)

(5)

where q iterates over all the possible places on the input feature map x, and the symbol G represents the bilinear interpolation kernel. It is vital to mention that G is a two-dimensional kernel that can be further divided into two one-dimensional kernels. It is mathematically explained as

G (q, p) = g (q_{x}, p_{x}) \times g (q_{y}, p_{y})

(6)

where g is explained as

g (a, b) = m a x (0, 1 - | a - b |)

. It is important to note that Equation (5) is more efficient since

G (q, p)

is zero for most of the

q s

. We refer our readers to [42,49] for a detailed explanation of deformable convolutions. Figure 5 depicts the architecture of deformable convolution. In order to convert our composite backbone network into its deformable counterparts, we replace conventional convolutional operation with deformable convolution in the higher-level layers that are from stage

C_{3}

to stage

C_{5}

. Table 1 highlights the presence of deformable convolution in the backbone network.

3.4. Image Transformation and Prepossessing

Document images mainly consist of textual regions. There exists a variable amount of gap between textual components. This gap not only separates the textual components but also provides a higher level of semantic representation. We can think of formula detection as a semantic labeling task where a textual unit is labeled as a formula or other text depending upon its contents. In order to group closely related regions, we apply dilation transformation on the images. The dilation transformation converts the input images to semantically enriched representation. It is crucial to understand that this grouping cannot replace the actual image content. Therefore, we concatenate the prepossessed images with the original images. This concatenation increases the number of input channels. The deep neural network processes this combination.

Dilation Transformation

The dilation transformation is used to thicken the black regions in the input image. Since this transformation works on binary images, the input images are binarized first. The black pixel represents the characters, and the white pixels describe the background in the binarized images. Therefore, this transformation thickens the characters. Figure 6 depicts the output of dilation transformation on one of the sample images. We use a structuring element of

2 \times 2

. We tried different sizes of the structuring elements. However,

2 \times 2

produces the optimal results.

3.5. Datasets

We employed the well-known publicly available formula detection datasets to conduct our experiments. This section elaborates upon these datasets, and their summary is presented in Table 2.

3.5.1. ICDAR-17

ICDAR-17 is the result of a recent competition in graphical page object detection (POD) [27] in document images at ICDAR in 2017. There are 2417 document images in the dataset having annotations for figures, formulas, and tables in document images. In addition, the dataset contains a variety of isolated formulas present on the single and multi-column document images. For the experiments, we have used 1600 images for training and 817 images for testing purposes. Recently, Younas et al. [20] published the corrected version of this dataset which leads to more formulas in the dataset. Therefore, we have employed the revised version of the dataset in our experiments for direct comparison with state-of-the-art results.

3.5.2. Marmot

Marmot [50] is fairly a smaller dataset consisting of 400 scanned document images. However, the dataset contains annotations for isolated and embedded mathematical equations. There are 1575 isolated formulas varying from 4 to 20 formulas per document image, whereas there are 7907 embedded formulas with an average of almost 20 embedded formulas per document image.

4. Experimental Results

4.1. Model Configuration

We implement the proposed method in Pytorch by leveraging the MMdetection object detection pipeline [51]. Our composite backbone ResNeXt-101 [46] is pre-trained on the MS-COCO dataset [52]. The pre-trained feature extraction network facilitates our object detection algorithm to adapt from the domain of natural scenes to documents. We scaled the input document images to 1200 × 800 but maintained the original aspect ratio. The training starts with a learning rate of 0.0025, which is reduced after every eighth epoch. We train the network for a total of 20 epochs for both of the datasets. The IoU threshold values for cascaded bounding boxes are set to [0.5, 0.6, 0.7]. We employed three different anchor ratios of [0.5, 1.0, 2.0] with only one anchor scale of [8] since FPN [24] itself performs the multi-scale detection owing to its top-down architecture. We operated with a batch size of one to train our network. The models for both of the datasets are trained on an NVIDIA GeForce RTX 101 Ti GPU with 12 GB memory.

4.2. Evaluation Metrics

For ICDAR-2017 POD, we work with the same evaluation criteria as elaborated in the ICDAR-2017 POD competition [27]. For the Marmot dataset, we follow the identical criteria of computing detection accuracy as explained in [11] to have direct comparisons. We report results by employing the following metrics.

4.2.1. Precision

Precision [53] defines the ratio of positive samples over all the predicted samples. Mathematically, it is given by

Precision = \frac{True Positives}{True Positives + False Positives}

(7)

4.2.2. Recall

Recall [53] calculates the ratio of positive samples in predictions over all the positive samples present in the ground truth. It is explained as follows:

Recall = \frac{True Positives}{True Positives + False Negatives}

(8)

4.2.3. F1-Score

The metrics f1-score [53] is the measure that is computed by taking the harmonic mean of precision and recall. The formula for f1-score is

F 1 - Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(9)

4.2.4. Mean Average Precision (mAP)

The mean average precision, also referred to as mAP score, is calculated by averaging maximum precision over various recall thresholds. Mathematically, it is explained in [52] as follows:

mAP = \frac{1}{N} \sum_{r = 1}^{N} A P_{r}

(10)

where

A P_{r}

is the average precision on a recall level r.

4.2.5. Intersection over Union (IOU)

The metrics Intersection over union [54] estimates the amount of predicted region intersecting with the ground truth region. It is explained as follows:

IoU (A, B) = \frac{Area of Overlap region}{Area of Union region} = \frac{| A \cap B |}{| A \cup B |}

(11)

4.2.6. Detection Accuracy

We report results on the Marmot dataset using the metrics of detection accuracy. As explained in [11], we classify the prediction into correct and partially correct based on the IoU value:

Correct: the predicted bounding box is considered correct when the IoU score between the predicted formula region and the ground truth is equal to or greater than 0.5.
Partial: when the IoU score between the inferred and the ground truth formula region is in the interval (0; 0.5), the detection is categorized as partial.

4.3. Result and Discussion

We report the results on the datasets of ICDAR-2017 POD [27] and Marmot [9] to demonstrate the effectiveness of the proposed method. This section analyzes the qualitative and quantitative performance of our approach by highlighting the strengths and weaknesses. Furthermore, it compares the presented results with prior state-of-the-art methods.

4.3.1. ICDAR-17

We follow the evaluation protocol as elaborated in ICDAR-2017 POD [27]. We first calculate the number of true positives, false positives, and false negatives from the complete test set. We then compute the precision, recall, and F1-Score as calculated in the prior methods [20,41]. Moreover, we also report the mAP score by evaluating the performance of our method on the test set. Following the criteria of the competition, we present results on the IoU threshold of 0.6 and 0.8. It is essential to emphasize that we have employed the recently published corrected version of the dataset [20]. Therefore, only the methods that have reported results on the corrected version of the dataset are directly comparable with our approach.

Table 3 presents the results that are achieved by our proposed end-to-end method with and without incorporating the pre-processing technique. After setting an IoU threshold of 0.6, we achieve a precision of 0.95, recall of 0.948, f1-score of 0.949, and mAP of 0.97 without the inclusion of the pre-processing method. The results further improve with an average of almost 0.04 after employing the proposed pre-processing. Upon increasing the IoU threshold value to 0.8, our network reaches a precision of 0.914, recall of 0.912, f1-score of 0.913, and mAP score of 0.949 in the absence of pre-processing method, and the presence of pre-processing advances the results with an average difference of 0.04. For the completeness of the paper, we compute the f1-score of the proposed method on different IoU thresholds ranging from 0.5 to 1.0. Figure 7 illustrates the performance of our approach in terms of f1-score.

Figure 8 and Figure 9 depict the qualitative performance of our proposed system. Out of 1929 isolate formulas present in the test set, our cascade formula detection network correctly predicted the region for 1836 formulas at an IoU threshold of 0.6. Moreover, it is vital to mention that even at a higher IoU threshold of 0.8, the system identified correct boundaries for 1767 formulas present in the test set. We also observe some rare cases of false positive and false negative samples, which are exhibited in Figure 9.

Comparison with State-of-the-Art Methods

By looking at Table 3, it is evident that our hybrid method of cascade network leveraging deformable composite backbone with lightweight pre-processing has outperformed the prior state-of-the-art method [20] on a higher IoU threshold of 0.8 with an average f1-score of 0.917, thus reducing the relative error by 7.8%. Furthermore, we achieve an almost identical f1-score at an IoU threshold of 0.6. It is essential to emphasize that the previous state-of-the-art work [20] depends on the heavy pre-processing pipeline consisting of distance transform and connected components analysis (CCA) applied on grayscale images. However, our generic data-driven method operates on the lightweight dilation technique to produce better results.

4.3.2. Marmot

We follow similar evaluation criteria to report results on the Marmot dataset in order to draw a direct comparison with the prior work. Our network separately detects the isolated and embedded formulas in a document image due to their variable sizes between isolated and embedded formulas. Table 4 summarizes the performance of our method on the Marmot dataset. As explained in Section 4.2.6, we calculate the accuracies of correct and partial detections. Our proposed mathematical formula identification system achieves the correct detection accuracy of

93 %

and

92.5 %

on isolated formulas with and without incorporating the pre-processing method, respectively. In embedded formulas, the system obtains the correct detection accuracy of

81.3 %

and

80.6 %

equipped with and without the proposed dilation method, respectively.

Besides calculating detection accuracy, we compute AP at an IoU threshold of 0.5 and 0.75 for both the isolated and embedded formula detection on the Marmot dataset. The achieved results are highlighted in Figure 10. Moreover, in Figure 11, we present the performance of correct detection accuracy over various IoU thresholds ranging from 0.5 to 1.0.

The qualitative performance analysis of the presented method on the Marmot dataset is exhibited in Figure 12, Figure 13 and Figure 14. We predict correct regions for 236 out of 253 formulas present in the test set in detecting isolated formulas. In the case of embedded formulas, the network is able to precisely detect 777 out of the 956 formulas from the test set.

Comparison with State-of-the-Art Methods

We compare our results with earlier approaches on the Marmot dataset in Table 4. From the table, It is evident that our cascade network with a deformable composite backbone has clearly outsmarted the prior state-of-the-art method [18] in detecting embedded formulas while accomplishing identical results in the case of isolated formulas. We reduce the relative error of 30% by achieving a detection accuracy of 81.3% on embedded formulas. Another point that is worth mentioning is the partial detection accuracy. The proposed system partly predicts 4.86% from the remaining 7% missing isolated formulas, which makes the total detection accuracy 97.86%. For embedded formulas, we achieve a partial detection accuracy of 6.77%, which adds up to an 88.07% total detection accuracy. Therefore, our network only missed 2.14% and 11.93% of isolated formulas and embedded formulas from the test set, respectively. The reduced number of missed detections in isolated and embedded formulas demonstrates the superiority of the proposed method.

5. Conclusions and Future Work

We introduce an end-to-end trainable network for the detection of formulas in document images. Our proposed method follows high-level architectural principles of traditional object detection approaches. Specifically, it exploits dilated document images fed into a Cascade Mask R-CNN equipped with a deformable composite dual backbone network. The proposed modifications help the network to achieve better generalization and detection performance. We achieve state-of-the-art performance on a higher IoU threshold with an f1-score of 0.917 on the ICDAR-2017 POD dataset. Furthermore, we reduce the relative error by 30% in detecting embedded formulas on the Marmot dataset with a correct detection accuracy of 81.3%. Not only do we improve the quantitative accuracy, but we also observe an outstanding improvement in terms of false-positive rates. Moreover, the presented work empirically establishes that without relying on heavy pre-processing pipelines, it is possible to achieve a state-of-the-art formula detection system in scanned document images.

For future work, we expect that a deeper backbone would be able to perform better in terms of both IoU and false positives. Moreover, the experiments can be extended to detect various graphical page objects such as figures, charts, titles, and headings in document images.

Author Contributions

Writing—original draft preparation, K.A.H.; writing—review and editing, K.A.H. and M.Z.A.; supervision, editing, and project administration, M.L., D.S. and A.P. All authors have read and agreed to the submitted version of the manuscript.

Funding

The work leading to this publication has been partially funded by the European project INFINITY under Grant Agreement ID 883293.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kieninger, T.; Dengel, A. The t-recs table recognition and analysis system. In Proceedings of the International Workshop on Document Analysis Systems, Nagano, Japan, 4–6 November 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 255–270. [Google Scholar]
Inoue, K.; Miyazaki, R.; Suzuki, M. Optical recognition of printed mathematical documents. Proc. Third Asian Technol. Conf. Math 1998, 3, 280–289. [Google Scholar]
Hashmi, K.A.; Ponnappa, R.B.; Bukhari, S.S.; Jenckel, M.; Dengel, A. Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 20–25 September 2019; Volume 5, pp. 116–121. [Google Scholar]
Smith, R. An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar]
Ahmad, R.; Afzal, M.Z.; Rashid, S.F.; Liwicki, M.; Breuel, T. Scale and rotation invariant OCR for Pashto cursive script using MDLSTM network. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 1101–1105. [Google Scholar]
Mokhtar, K.; Bukhari, S.S.; Dengel, A. OCR Error Correction: State-of-the-Art vs an NMT-based Approach. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 429–434. [Google Scholar]
Mahdavi, M.; Zanibbi, R.; Mouchere, H.; Viard-Gaudin, C.; Garain, U. ICDAR 2019 CROHME+ TFD: Competition on recognition of handwritten mathematical expressions and typeset formula detection. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1533–1538. [Google Scholar]
Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.A.; Afzal, M.A.; Afzal, M.Z. Current Status and Performance Analysis of Table Recognition in Document Images with Deep Neural Networks. IEEE Access 2021, 9, 87663–87685. [Google Scholar] [CrossRef]
Fang, J.; Tao, X.; Tang, Z.; Qiu, R.; Liu, Y. Dataset, ground-truth and performance metrics for table detection evaluation. In Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems, Gold Coast, QLD, Australia, 27–29 March 2012; pp. 445–449. [Google Scholar]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4974–4983. [Google Scholar]
Phong, B.H.; Hoang, T.M.; Le, T.L. A hybrid method for mathematical expression detection in scientific document images. IEEE Access 2020, 8, 83663–83684. [Google Scholar] [CrossRef]
Kacem, A.; Belaïd, A.; Ahmed, M.B. Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Doc. Anal. Recognit. 2001, 4, 97–108. [Google Scholar] [CrossRef] [Green Version]
Baker, J.B.; Sexton, A.P.; Sorge, V. Towards Reverse Engineering of PDF Documents. DML Towards Digit. Math. Libr. 2011, 4, 65–75. [Google Scholar]
Jin, J.; Han, X.; Wang, Q. Mathematical Formulas Extraction; Icdar. Citeseer: Edinburgh, UK, 2003; pp. 1138–1141. [Google Scholar]
Chang, T.Y.; Takiguchi, Y.; Okada, M. Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, State of Paraná, Brazil, 23–26 September 2007; Volume 2, pp. 1193–1197. [Google Scholar]
Liu, Y.; Bai, K.; Gao, L. An efficient pre-processing method to identify logical components from pdf documents. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2011; pp. 500–511. [Google Scholar]
Ohyama, W.; Suzuki, M.; Uchida, S. Detecting mathematical expressions in scientific document images using a u-net trained on a diverse dataset. IEEE Access 2019, 7, 144030–144042. [Google Scholar] [CrossRef]
Phong, B.H.; Dat, L.T.; Yen, N.T.; Hoang, T.M.; Le, T.L. A deep learning based system for mathematical expression detection and recognition in document images. In Proceedings of the 12th International Conference on Knowledge and Systems Engineering (KSE), Can Tho City, Vietnam, 12–14 November 2020; pp. 85–90. [Google Scholar]
Mali, P.; Kukkadapu, P.; Mahdavi, M.; Zanibbi, R. ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images. arXiv 2020, arXiv:2003.08005. [Google Scholar]
Younas, J.; Siddiqui, S.A.; Munir, M.; Malik, M.I.; Shafait, F.; Lukowicz, P.; Ahmed, S. Fi-Fo Detector: Figure and Formula Detection Using Deformable Networks. Appl. Sci. 2020, 10, 6460. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar]
Huang, Y.; Yan, Q.; Li, Y.; Chen, Y.; Wang, X.; Gao, L.; Tang, Z. A YOLO-based table detection method. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 813–818. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2017, arXiv:1612.03144. [Google Scholar]
Agarwal, M.; Mondal, A.; Jawahar, C. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. arXiv 2020, arXiv:2008.10831. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Gao, L.; Yi, X.; Jiang, Z.; Hao, L.; Tang, Z. ICDAR2017 competition on page object detection. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 10–15 November 2017; Volume 1, pp. 1417–1422. [Google Scholar]
Lin, X.; Gao, L.; Tang, Z.; Baker, J.; Sorge, V. Mathematical formula identification and performance evaluation in PDF documents. Int. J. Doc. Anal. Recognit. 2014, 17, 239–255. [Google Scholar] [CrossRef]
Fateman, R.J.; Tokuyasu, T.; Berman, B.P.; Mitchell, N. Optical character recognition and parsing of typeset mathematics1. J. Vis. Commun. Image Represent. 1996, 7, 2–15. [Google Scholar] [CrossRef] [Green Version]
Lee, H.J.; Wang, J.S. Design of a mathematical expression understanding system. Pattern Recognit. Lett. 1997, 18, 289–298. [Google Scholar] [CrossRef]
Toumit, J.Y.; Garcia-Salicetti, S.; Emptoz, H. A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), Bangalore, India, 20–22 September 1999; pp. 119–122. [Google Scholar]
Garain, U.; Chaudhuri, B. A syntactic approach for processing mathematical expressions in printed documents. In Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain, 3–7 September 2000; Volume 4, pp. 523–526. [Google Scholar]
Chowdhury, S.; Mandal, S.; Das, A.K.; Chanda, B. Automated segmentation of math-zones from document images. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 3–6 August 2003; Citeseer: Princeton, NJ, USA, 2003; pp. 755–759. [Google Scholar]
Drake, D.M.; Baird, H.S. Distinguishing mathematics notation from English text using computational geometry. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Korea, 29 September–1 August 2005; pp. 1270–1274. [Google Scholar]
Bhatt, J.; Hashmi, K.A.; Afzal, M.Z.; Stricker, D. A Survey of Graphical Page Object Detection with Deep Neural Networks. Appl. Sci. 2021, 11, 5344. [Google Scholar] [CrossRef]
He, W.; Luo, Y.; Yin, F.; Hu, H.; Han, J.; Ding, E.; Liu, C.L. Context-aware mathematical expression recognition: An end-to-end framework and a benchmark. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3246–3251. [Google Scholar]
Gao, L.; Yi, X.; Liao, Y.; Jiang, Z.; Yan, Z.; Tang, Z. A deep learning-based formula detection method for PDF documents. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 10–15 November 2017; Volume 1, pp. 553–558. [Google Scholar]
Yi, X.; Gao, L.; Liao, Y.; Zhang, X.; Liu, R.; Jiang, Z. CNN based page object detection in document images. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 10–15 November 2017; Volume 1, pp. 230–235. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Li, X.H.; Yin, F.; Liu, C.L. Page object detection from pdf document images by deep structured prediction and supervised clustering. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3627–3632. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; Wang, Y.; Wang, S.; Liang, T.; Zhao, Q.; Tang, Z.; Ling, H. Cbnet: A novel composite backbone network architecture for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11653–11660. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv 2016, arXiv:1611.05431. [Google Scholar]
Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv 2015, arXiv:1506.06579. [Google Scholar]
Siddiqui, S.A.; Malik, M.I.; Agne, S.; Dengel, A.; Ahmed, S. Decnt: Deep deformable cnn for table detection. IEEE Access 2018, 6, 74151–74161. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 9308–9316. [Google Scholar]
Lin, X.; Gao, L.; Tang, Z.; Lin, X.; Hu, X. Performance evaluation of mathematical formula identification. In Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, Queensland, Australia, 27–29 March 2012; pp. 287–291. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arxiv 2014, arXiv:1405.0312. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Blaschko, M.B.; Lampert, C.H. Learning to localize objects with structured output regression. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–16 October 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 2–15. [Google Scholar]
Chu, W.T.; Liu, F. Mathematical formula detection in heterogeneous document images. In Proceedings of the 2013 Conference on Technologies and Applications of Artificial Intelligence, Taipei, Taiwan, 6–8 December 2013; pp. 140–145. [Google Scholar]

Figure 1. Visual depiction of the need to apply formula detection before extracting information in document images. We apply open source Tesseract-OCR [4] on a document image taken from Marmot dataset [9] containing mathematical formulas as illustrated in (a). Besides the textual content, the OCR system fails miserably in recognizing information from formulas as depicted in (b).

Figure 2. Instances of isolated and embedded formulas in sample document images. The green boundaries represent the ground truth regions. Separate images are used for the convenience of the readers. The isolated formulas highlighted in (a), spanning multiple lines, are prone to be misclassified with tables, whereas the embedded formulas depicted in (b) are confused with the regular text.

Figure 3. The presented framework is based on Cascade Mask R-CNN equipped with a deformable composite backbone applied on dilated document images. Modules B, C, and D represent bounding box, classification, and segmentation, respectively.

Figure 4. Visual explanation of the employed backbone (CBNet) in our framework. We utilize a dual ResNeXt-101 backbone, in which there are composite connections between parallel stages of the adjacent assistant and lead backbone. Moreover, we replace the conventional convolutions in ResNeXt101 with deformable convolution.

Figure 5. This figure illustrates a visual demonstration of a deformable convolution. The green grid depicts the conventional

3 \times 3

convolutional operation, whereas the blue boxes highlight the effective receptive field of a similar

3 \times 3

deformable convolution.

Figure 5. This figure illustrates a visual demonstration of a deformable convolution. The green grid depicts the conventional

3 \times 3

convolutional operation, whereas the blue boxes highlight the effective receptive field of a similar

3 \times 3

deformable convolution.

Figure 6. Visual comparison of a document image before (shown in (a)) and after the pre-processing method (depicted in (b)). The dilation process facilitates our feature extraction network by increasing the boundaries of foreground pixels, reducing the number of background pixels.

Figure 7. Performance evaluation in terms of f1-score over the varying IoU thresholds ranging from 0.5 to 1.0 on the ICDAR-2017-POD dataset.

Figure 8. Performance evaluation of the proposed method on the ICDAR-2017-POD dataset. The green colour depicts the ground truth, while red denotes the predicted bounding boxes. (a,b) exhibit true positives on a two-column and a single column document image, respectively.

Figure 9. Performance evaluation of the proposed method on the ICDAR-2017-POD dataset. The green colour depicts the ground truth, while red denotes the predicted bounding boxes. (a) represents an example of both true positives and a single false positive case, whereas (b) shows three false positives with one false negative.

Figure 10. Performance evaluation of the proposed method on the Marmot dataset in terms of average precision (AP) at an IoU threshold of 0.5 and 0.75. (a) represents the the evolution of AP on isolated formulas, whereas (b) exhibits the the evolution of AP on embedded formulas.

Figure 11. Correct detection accuracy achieved on varying IoU threshold ranging from 0.5 to 1.0 on the Marmot dataset. (a) depicts the correct detection accuracy on isolated formulas, whereas (b) demonstrates the correct detection accuracy on embedded formulas.

Figure 12. Instances of correct and partial detection of isolated formulas on the Marmot dataset. The green color represents the correct detections, whereas the partial and missed detections are highlighted with red and blue colors, respectively. (a) depicts a couple of samples of correct detection in which an IoU score between ground truth and predicted region is greater than or equal to 0.5, whereas (b) illustrates a few cases of partial and missed detection.

Figure 13. Instances of correct detections of embedded formulas in a document image taken from the Marmot dataset. The green color highlights the ground truth, whereas the predictions are marked with red color.

Figure 14. Example of partial and missed detections of embedded formulas in a document image taken from the Marmot dataset. While green color highlights the correct predictions, partial and missed detections are marked with red and blue colors, respectively.

Table 1. Architectural details of the employed dual composite ResNeXt-101 backbone network. DCN represents the incorporation of deformable convolution.

Stage	Output	DCN	ResNeXt-101 (32 × 4d)
conv1	112 × 112	✗	7 × 7, 64, stride 2
conv2	56 × 56	-	3 × 3 max pooling, stride 2
conv2	56 × 56	✗	1 × 1, 128 3 × 3, 128, C = 32 × 3 1 × 1, 128
conv3	28 × 28	✓	1 × 1, 256 3 × 3, 256, C = 32 × 4 1 × 1, 512
conv4	14 × 14	✓	1 × 1, 512 3 × 3, 512, C = 32 × 23 1 × 1, 1024
conv5	7 × 7	✓	1 × 1, 1024 3 × 3, 1024, C = 32 × 3 1 × 1, 2048
	1 × 1	✗	global average pool 1000-d fc, softmax

Table 2. Summary of the main statistics of the employed datasets.

Datasets	ICDAR-17		Marmot
Datasets	Train	Test	Train	Test
Number of Images	1600	817	330	70
Number of Isolated Formulas	3534	1929	1322	253
Number of Embedded Formulas	-	-	6951	956

Table 3. Quantitative analysis of the presented work with existing state-of-the-art methods on the ICDAR-2017 POD dataset. † represents the results that are not directly comparable with our method because they are not evaluated on the revised version of the dataset.

ICDAR-2017 POD
Method	IoU = 0.6				IoU = 0.8
Method	Precision	Recall	F1-Score	AP	Precision	Recall	F1-Score	AP
NLPR-PAL [27] †	0.901	0.929	0.915	0.839	0.888	0.916	0.902	0.816
Li et al. [41] †	0.935	0.331	0.489	0.312	0.877	0.310	0.459	0.274
Fi-Fo Detector Non Deformable [20]	0.910	0.927	0.918	0.953	0.860	0.877	0.868	0.928
Fi-Fo Detector Deformable [20]	0.957	0.952	0.954	0.949	0.913	0.908	0.910	0.898
Ours (Without Pre-Processing)	0.950	0.948	0.949	0.97	0.914	0.912	0.913	0.949
Ours (Complete Method)	0.954	0.952	0.953	0.97.5	0.918	0.916	0.917	0.954

Table 4. Performance comparison between our method and previous state-of-the-art approaches on the Marmot dataset.

Method	Formula	Correct (%)	Partial (%)	Total
Chu et al. [55]	Isolated	26.87	44.87	71.76
Chu et al. [55]	Embedded	1.74	28.87	30.61
Phong et al. [11]	Isolated	50.37	39.14	91.18
Phong et al. [11]	Embedded	22.9	58.45	81.35
Phong et al. [18]	Isolated	93	-	-
Phong et al. [18]	Embedded	73	-	-
Ours (Without Pre-processing)	Isolated	92.5	4.64	97.14
Ours (Without Pre-processing)	Embedded	80.6	6.23	86.83
Ours (Complete)	Isolated	93	4.86	97.86
Ours (Complete)	Embedded	81.3	6.77	88.07

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images. Appl. Sci. 2021, 11, 7610. https://doi.org/10.3390/app11167610

AMA Style

Hashmi KA, Pagani A, Liwicki M, Stricker D, Afzal MZ. Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images. Applied Sciences. 2021; 11(16):7610. https://doi.org/10.3390/app11167610

Chicago/Turabian Style

Hashmi, Khurram Azeem, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2021. "Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images" Applied Sciences 11, no. 16: 7610. https://doi.org/10.3390/app11167610

APA Style

Hashmi, K. A., Pagani, A., Liwicki, M., Stricker, D., & Afzal, M. Z. (2021). Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images. Applied Sciences, 11(16), 7610. https://doi.org/10.3390/app11167610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Cascade Mask R-CNN

3.2. Composite Backbone

3.3. Deformable Convolution

3.4. Image Transformation and Prepossessing

Dilation Transformation

3.5. Datasets

3.5.1. ICDAR-17

3.5.2. Marmot

4. Experimental Results

4.1. Model Configuration

4.2. Evaluation Metrics

4.2.1. Precision

4.2.2. Recall

4.2.3. F1-Score

4.2.4. Mean Average Precision (mAP)

4.2.5. Intersection over Union (IOU)

4.2.6. Detection Accuracy

4.3. Result and Discussion

4.3.1. ICDAR-17

Comparison with State-of-the-Art Methods

4.3.2. Marmot

Comparison with State-of-the-Art Methods

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI