Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

: This paper presents a novel architecture for detecting mathematical formulas in 1 document images, which is an important step for reliable information extraction in several domains. 2 Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in 3 computer vision. In this paper, we suggest a couple of modiﬁcations to the existing Cascade 4 Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of 5 conventional convolutions in the backbone network to spot areas of interest better. Second, it uses 6 a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our 7 proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 8 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on 9 ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error 10 by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas 11 on the Marmot dataset, which results in a relative error reduction of 30%. 12


Introduction
Information extraction from document images is the primary need in various 16 domains such as banking, archiving, or academia and industry in general. Research in 17 document analysis has been trying to develop precise information extraction systems 18 for several years [1][2][3][4]. Although state-of-the-art Optical Character Recognition (OCR) 19 systems [5,6] recognize regular text with high accuracy, they are vulnerable to recognize 20 information from page objects (tables, figures, mathematical formulas) in document 21 images [7,8]. Figure 1 illustrates the problem in which an open-source OCR, Tesseract [4] 1 22 is applied to extract the content from a document image. Besides recognizing the textual 23 content, the OCR fails to extract the information from mathematical formulas. This 24 shows that formula detection is a crucial preliminary step for the information extraction 25 in such document images. 26 Mathematical formulas are an integral part of documents because they allow us 27 to represent complex information concisely by exploiting mathematics capabilities. 28 Formulas present in the documents are categorized into isolated formulas (mentioned 29 in a separate line) and embedded formulas (inline mathematical symbols). Figure 2 30 exhibits the problem of detecting isolated and embedded formulas in document images. 31 1 We use latest (LSTM based) version 4.1.1 available at https://github.com/tesseract-ocr/tesseract The task of detecting both isolated and embedded formulas in document images 32 is a difficult problem because of the underlying low inter-class and high intra-class 33 variance [9]. The hurdles involved in detecting isolated and embedded formulas are 34 exhibited in Figure 2. The isolated formulas present in a document image can eas- 35 ily be misclassified with other page objects due to low inter-class variance with ta- 36 bles, algorithms, and figures. The embedded formulas contain mathematical functions 37 (log, exp, tan), operators (×, +, σ, %), and variables (i, j, k). These inline expressions are 38 prone to misinterpret with the regular text in a document image [10]. 39 Isolated Embedded Figure 2. Instances of isolated and embedded formulas in sample document images. The green boundaries represent the ground truth regions. Separate images are used for the convenience of the readers. The isolated formulas spanning multiple lines are prone to misclassified with tables, whereas the embedding formulas confuse with the regular text.
Previous works employed hand-crafted features to detect formulas in documents 40 [2,11,12]. Although these systems extract mathematical formulas, they fail to obtain 41 effective results on generic datasets. Later, statistical learning, mainly machine learning- 42 based methods, has advanced the performance of formula identification systems [13][14][15]. 43 The recent success of deep learning-based methods on computer vision within the last 44 decade also had an impact on the task of formula detection in scanned document images. 45 Several deep learning-based formula detection approaches [16][17][18][19] have been presented 46 Version July 6, 2021 submitted to Appl. Sci.

of 16
in the past two years. They are mainly equipped with object detection algorithms such 47 as Faster R-CNN [20], YOLO [21], SSD [22], and FPNs [23]. 48 In recent work, Agarwal et al. [24]  in documents is a well-studied problem [28]. Noticeable progress has been achieved 68 in this domain by implementing custom-heuristics to deep learning-based approaches.

107
Another graphical page object detection system is published by Li et al. [41]. The au-     We employ a robust and novel dual backbone architecture to extract the possible 142 spatial features to detect formulas in document images. The performance of any object 143 detection algorithm depends on the quality of the feature map it receives from the 144 feature extraction network [44]. In this paper, we implement a dual backbone-based 145 network [45] in which the first backbone is the assistance backbone, and the other is For the conventional convolutional network with single backbone, the output of l − 1 − th stage is propagated as input to the l − th stage which is given by: where F l represents the non-linear function on l − th level. Contrary to this, our backbone network receives input from prior levels and parallel level of the assistant backbone. Therefore, the input of a lead backbone bl at stage l is the product of output of lead backbone at l − 1th stage and parallel l − th stage of assistant backbone ba.
Mathematically it is explained in [45] as: Where g defines the composite connection between the lead and assistant backbone, 150 these composite connections enable the lead backbone to extract essential spatial features.

151
As explained in Figure 3, we propagate the output of the final lead backbone to the 152 region proposal network of our Cascade Mask R-CNN. 153

154
We incorporate deformable convolution filters [42] instead of the conventional 155 convolutions that exists in the ResNeXt-101 architecure [46]. The Convolutional neural 156 networks extract the important spatial features that are essential to perform the required 157 task. Based on the hierarchy, convolutional layers discover different features [47].

158
Convolutional layers present at the bottom search for crude features such as sharp edges 159 or the gradients, whereas the layers at higher levels look for the abstract components 160 such as complete object [48]. The conventional convolution operation has the same 161 effective receptive field for all the neurons. The 2D convolution comprises of two parts: In case of deformable convolution, an additional offset represented as ∆(p n ) is 170 added, which deforms the filter's receptive field by augmenting the predefined offsets.

171
Hence, the Equation 3 as explained in [42] is transformed into This modification makes the sampling process irregular with an offset value of p n + ∆(p n ). As these offsets are differentiable and fractional, bilinear interpolation is used to implement them. Considering, p = p 0 + p n + ∆(p n ), the bilinear interpolation is implemented as follows: where q iterates over all the possible places on the input feature map x and the 173 symbol G represents the bilinear interpolation kernel. It is vital to mention that G is a 174 two-dimensional kernel that can be further divided into two one-dimensional kernels. It 175 is mathematically explained as: 176 G(q, p) = g(q x , p x ) × g q y , p y (6) where g is explained as g(a, b) = max(0, 1 − |a − b|). It is important to note that 177 Equation 5 is more efficient since G(q, p) is zero for most of the qs. We refer our readers 178 to [42,49] for a detailed explanation about deformable convolutions.  The dilation transformation is used to thicken the black regions in the input image.

192
Since this transformation works on binary images, the input images are binarized first.

193
The black pixel represents the characters, and the white pixels describe the background 194 in the binarized images. Therefore, this transformation thickens the characters. Figure   195 6 depicts the output of dilation transformation on one of the sample images. We use 196 a structuring element of 2 × 2. We tried different sizes of the structuring elements.

197
However, 2 × 2 produces the optimal results. Orignal Image After dilation Figure 6. Example of a document image before and after our pre-processing method. The dilation process facilitates our feature extraction network by increasing the boundaries of foreground pixels which results in reducing the number of background pixels.

220
We implement the proposed method in Pytorch by leveraging the MMdetection 221 object detection pipeline [51]. Our composite backbone ResNeXt-101 [46] is pre-trained 222 on MS-COCO dataset [52]. The pre-trained feature extraction network facilitates our 223 object detection algorithm to adapt from the domain of natural scenes to documents. 224 We scaled the input document images to 1200 × 800 but maintained the original aspect  the ICDAR-2017 POD competition [26]. For the Marmot dataset, we follow the identical 235 criteria of computing detection accuracy as explained in [10] to have direct comparisons. 236 We report results by employing the following metrics: The metrics f1-score [53] is the measure that is computed by taking harmonic mean of precision and recall. The formula for f1-score is:

241
The mean average precision also referred to as mAP score is calculated by averaging maximum precision over various recall thresholds. Mathematically, it is explained in [52] as follows: where AP r is the average precision on a recall level r. The metrics Intersection over union [54] estimates the amount of predicted region intersecting with the ground truth region. It is explained as follows:

244
We report results on the Marmot dataset using the metrics of detection accuracy. As 245 explained in [10], we classify the prediction into correct and partial correct based on the 246 IoU value: 247

1.
Correct: the predicted bounding box is considered correct when the IoU score 248 between the predicted formula region and the ground truth is equal or greater than 249 0.5.

2.
Partial: when the IoU score between the inferred and the ground truth formula 251 region is in the interval (0; 0.5), the detection is categorized as partial.

253
We report results on the datasets of ICDAR-2017 POD [26] and Marmot [27] to  We follow the evaluation protocol as elaborated in ICDAR-2017 POD [26]. We the prior methods [19,41]. Moreover, we also report the mAP score by evaluating the 263 performance of our method on the test set. Following the criteria of the competition, we 264 present results on the IoU threshold of 0.6 and 0.8. It is essential to emphasize that we 265 have employed the recently published corrected version of the dataset [19]. Therefore, 266 only the methods that have reported results on the corrected version of the dataset are 267 directly comparable with our approach.
268 Table 2 presents the results that are achieved by our proposed end-to-end method

309
We compare our results with earlier approaches on the Marmot dataset in Table   310 3. From the

323
We introduce an end-to-end trainable network for the detection of formulas in false-positive rates. Moreover, the presented work empirically establishes that without 334 relying on heavy pre-processing pipelines, it is possible to achieve a state-of-the-art 335 formula detection system in scanned document images.

336
For future work, we expect that a deeper backbone would be able to perform better