A Survey of Graphical Page Object Detection with Deep Neural Networks

. Abstract: In any document, graphical elements like tables, ﬁgures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that make the digitization of documents viable. Since the advent of deep learning, deep learning-based object detection performance has improved many folds. This work outlines and summarizes the deep learning approaches for detecting graphical page objects in document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses brieﬂy the promising directions that can be utilized for further improvements.


Introduction
The rapid increase in digitization of document images in both financial and nonfinancial sectors has considerably improved the accessibility of the data. To obtain reliable information from these scanned document images, options like manual capturing of data have become highly laborious and impractical. Therefore, over the last few decades, accurate information extraction has been vital research for the document analysis community [1][2][3][4].
Apart from the text, information in scanned documents is often stored in a graphical manner, such as tables, formulas, and figures. These are referred to as graphical page objects in document analysis community [5]. Figure 1 illustrates the problem that involves the detection of figures, formulas, and tables in document images. It is imperative to detect the graphical page objects before applying optical character recognition (OCR). One such scenario is information extraction from document images. Figure 2 illustrates the necessity of applying graphical page object detection systems for information extraction in document images. It is evident that even the state-of-the-art OCR method [6] fails to extract precise information from figures, tables, and formulas. Another application of such page object detection methods is document retrieval systems [7,8], where a document image having a specific type of page object is required. Therefore, it is essential to develop approaches that can parse the information from these page objects. With the recent surge of deep learning-based object detection algorithms in computer vision [10][11][12], a considerable amount of methods are developed that have formulated the problem of detecting graphical page objects in document images as an object detection problem. Furthermore, several datasets consisting of thousands of annotated scanned document images are also published. Although the approaches leveraging these datasets have significantly improved state-of-the-art, a consolidated comparison among these approaches is missing.
In this survey paper, we have presented a thorough analysis of the recent state-of-theart approaches that have approached the problem of graphical page object detection in scanned document images by employing deep neural networks. Since page objects can be of several types [13], we have covered the three most important page objects in document images [9]. These graphical page objects are referred to as LR, Sharpe Ratio :- (12) FRearomta Dr = max(R;, -Rr,|to < ta < ty <1) (13) Fig. 11. Indicators used by the genetic algorithm by slippage.    [5], whereas the extracted information is present on the right side. We have applied an open-source Tesseract OCR [6] to extract the information. Since the OCR correctly recognized the textual content, we only demonstrate the extracted information from graphical page objects for brevity. The incorrectly extracted content depicts that graphical page object detection is an essential preliminary step before information extraction. This paper investigates how deep neural network-based approaches work on detecting these types of page objects. Therefore, we have covered the most relevant approaches that have produced state-of-the-art results in this domain. Some of the discussed approaches work only on a single page object, and some have covered all three of them. However, our primary focus is to provide a perspective about the outcome of deep learning-based approaches on graphical page object detection in document images. To summarize, our contributions are as follows:

1.
We present the comparisons between recently introduced algorithms for improving page object detection by highlighting their advantages and limitations.

2.
We present a brief overview of the publicly available challenging datasets for graphical page object detection.

3.
We provide an evaluative comparison among the state-of-the-art graphical page object detection systems. Figure 3 illustrates the complete flow of this survey paper, whereas the remainder of the paper is organized as follows: Section 2 presents a brief overview of the prior works that have exploited traditional approaches to detect graphical page objects. Section 3 explains all the approaches contributing to graphical page objects by leveraging deep learning methods. Section 4 highlights all the publicly available datasets that can be employed to tackle the mentioned problem. Section 5 explains the mostly employed evaluation metrics and analyzes performances of all the discussed approaches in Section 3. Section 6 concludes the paper with a discussion on the current challenges and highlights the future directions.

Traditional Approaches
The problem of graphical page object detection in documents is a well-recognized problem. Several approaches that employed traditional methods are introduced in this domain. Figure 4 illustrates the fundamental differences between the traditional approaches and the deep learning-based approaches. The traditional approaches leverage image processing techniques such as binarization and connected component analysis. Contrarily, deep learning-based methods utilize backbone CNN to generate the spatial feature maps from the document images.  . Visual depiction of the basic differences between the traditional methods and the deep learning-based techniques. Traditional approaches rely heavily on image processing methods and custom heuristics whereas deep learning techniques leverage convolutional neural networks-based architectures. In deep learning approaches, spatial features from the document images are extracted from backbone networks such as VGG-16 [14] or ResNet [15]. These features are further propagated to region detection or segmentation networks to classify and localize page objects.
In order to implement table detection, the prior techniques [16][17][18] have defined a certain underlying structure for tables in a document. Tupaj et al. [19] employed Optical Character Recognition (OCR) to extract tabular information. The method tried to recognize possible table areas by analyzing the keywords and white spaces. The main disadvantage of this approach is that it is fully based on the presumptions regarding the tables' structure and the collection of the used keywords.
Wang et al. [20] proposed another approach in the field of table analysis. It utilizes distance between consecutive words to detect table lines. Subsequently, adjacent vertical lines are grouped with consecutive horizontal words to propose table entity candidates. However, the underlying assumption is that there can be a maximum of two columns in a table. Hence, three types of layouts (single, double, and mixed columns) are designed in this approach. The drawback of this method is that it is only applicable to a limited number of designed temples.
Kieninger et al. [21][22][23] introduced a system called T-Recs to extract tabular information from documents. Their method takes the word bounding boxes which are segregated to build a segmentation graph in a bottom-up manner. Their system is vulnerable to tables containing multi rows and columns.
A method for detecting tables by calculating the intersection area between the vertical and horizontal lines was suggested by Gatos et al. [24]. The recreation of tables is then done by denoting corresponding vertical and horizontal lines related to intersection pairs. This approach presumes that a table should have ruling lines. A method for table detection by using Hidden Markov Models (HMMs) was suggested by Costa e Silva et al. [25]. The method fetches text from PDF files by applying the pdftotext Linux utility. Then feature vectors are computed based upon the gaps present between the text. This approach can only be employed for non-raster PDF files that do not contain noisy data.
A method for table detection under the assumption that tables in documents can contain only singular columns is proposed by Hu et al. [26]. Another technique for table detection in heterogeneous documents was proposed by Shafait et al. [18]. This mechanism is built into an open-source Tesseract OCR engine [6]. Although these traditional approaches were effective on the documents with restricted layout variations, either they rely on the meta-data or highly depends on the post-processing methods involving custom heuristics. Furthermore, they fail to produce similar results on generic datasets. Therefore, it is essential to exploit recently proposed deep learning techniques to tackle the problem of graphical page object detection in document images.

Methodologies
Graphical objects like tables, figures, and formulas are an integral part of documents because they hold a significant amount of information in a confined space. As explained in Section 1, detecting the graphical object means localizing these objects within a document image. Conceptually this problem is identical to localizing the objects in natural scene images. Recently, deep learning algorithms have also attracted the interest of researchers in the document image analysis community.
This section will discuss the methodologies that have utilized the capabilities of deep neural networks to solve the problem of graphical page object detection in document images. By following the convention of [5], we have covered approaches that have worked on the detection of the following graphical page objects in document images: (1) Tables, (2) Figures, and (3

) Formulas.
For the convenience of our readers, we have classified the methodologies according to the employed deep learning concepts. We discuss the organizational flow of the methodologies in Figure 5. Table 1 summarizes the presented approaches and highlights their advantages and limitations.
The dynamic receptive field of deformable CNN helps recognize tabular broundaries having arbitrary layouts.
Deformable CNN requires more computation as compared to conventional CNN.
Fi-Fo Detector [28] Color transform, connected component analysis, distance transform applied on images that are fed to deformable pyramid network.
(a) Transformed images yield better results as compared to raw input images. (b) The approach leverages the deformable FPN model in their object detection network.
The approach depends on the extra pre-processing steps.
Straightforward and effective approach to detect tables.
Does not perform as accurate as other state-of-the-art methods.
Leveraging the power of both selective search and region proposal network to generate reliable region of interests.
Computationally expensive because of combination of two separate object detection networks.
Simple end-to-end approach to detect multiple page objects.
The network often mis-classifies the similar-looking page objects belonging to different classes.
The distance transform method helps the object detection network to focus on desired page object.
Requires extra pre-processing method.
(a) Multi-scale feature pyramid network. (b) Deformable Convolution improves the performance.
The method requires high computational resources due to composite backbone and deformable convolutions.
Kavasidis et al. [35] Semantic image segmentation with saliency detection. The approach depends on multiple pre-processing steps to achieve good results.
Replacing non maximal suppression with dynamic programming algorithm improves the refining process for region of interests.
Extra post-processing over head.

Faster R-CNN
Recently, it has been the case that the improvement of object detection algorithms in the field of computer vision has a direct relation with the improvement of graphical page object detection in document images. Faster R-CNN [11] which is the improved version of Fast R-CNN [31] is a two-stage object detection network. Figure 6 illustrates the architecture of Faster R-CNN. In order to obtain a detailed explanation about the architecture, readers may refer to [11]. This section covers the approaches that detect the graphical page objects by exploiting the capabilities of Faster R-CNN [11]. An image-based deep learning table detection approach was suggested by Schreiber et al. [29] where they implemented Faster R-CNN for detection of tables in document images. The paper presents that the recently introduced object detectors dependent on Convolutional Neural Networks (CNN) can detect tables in document images. By leveraging back-bones like ZFNet [37] and VGG-16 [14], the authors have achieved promising results on ICDAR-13 dataset [38]. Their approach has also utilized the transfer learning technique by using the pre-trained model on the Pascal-VOC dataset [39]. They also attempt table  structure recognition along with table detection. Vo et al. [30] published a method for page object detection, which involves detecting figures, formulas, and tables. Their technique makes use of an ensemble technique of Fast R-CNN [31], and Faster R-CNN [11]. They combined the region proposals obtained from Fast R-CNN and Faster R-CNN and then apply bounding box regression to boost performance. They have used the ICDAR-17 POD [5] dataset to benchmark their approach.
The blend of traditional methods and deep learning networks is presented by Younas et al. [40] to solve the problem of formula and figure detection in document images. The authors propose that instead of giving raw input images to object detection algorithms, transformed image representations yield better results. Connected component analysis (CC), distance transform, and color transform on the raw input images are performed and are subsequently processed using the Faster R-CNN model.
Gilani et al. [33] have utilized a similar technique. They have used the image transformation method in which a Euclidean distance transform [41], linear distance transform [42], and max distance transform [43] are applied on blue, green, and red channels of the input image, respectively. This transformed image is further propagated to Faster R-CNN to identify and regress the tabular boundaries in document images.
Another approach in which performance of two state-of-the-art object detection networks: Faster R-CNN [11] and Mask R-CNN [10] is compared on graphical page objects [32]. The article presents exhaustive evaluations on the detection of tables, formulas, and figures in document images. The paper's conclusion states that Mask R-CNN [10] is better suited to solve the problem of page object detection because of having extra components in the loss function.

Mask R-CNN
Mask R-CNN [10] is the extended model of Faster R-CNN [11] with an addition of an extra loss known as segmentation loss. Figure 7 depicts the basic architecture of Mask R-CNN. However, the comprehensive detail about the network can be found at [10]. The graphical page objects present in the document images have very low inter-class variance. An object originally labeled as a table can easily be misinterpreted with a figure or formula. By leveraging the segmentation loss of Mask R-CNN, researchers in the document image analysis community have improved the performance of graphical page object detection systems. This section covers those methodologies. Saha et al. [32] published the method for page object detection in document images through employing Mask R-CNN. Their end-to-end deep learning-based system, called Graphical Object Detection (GOD), detects the tables, figures, and formulas directly from the raw input images. The authors propose that there is no need to add extra pre or post-processing steps to solve page object detection. By leveraging the power of transfer learning, the authors have done bench-marking on the well-known datasets of ICDAR-17 POD [5], UNLV [44], and ICDAR-13 [38].
A recent end-to-end table detection network called CDeC-Net is introduced by Agarwal et al. [34]. The system CDeC-Net leverages the novel object detection network Cascade Mask R-CNN based on Cascade R-CNN [12]. The presented article has shown a noticeable improvement in the performance of table detection system across several datasets such as ICDAR-17 POD [5], ICDAR-13 [38], ICDAR-2019, Marmot [45], TableBank [46], PubLayNet [9], and UNLV [44]. After extensive evaluations, the authors have concluded that the network Cascade Mask R-CNN is superior to the previous state-of-the-art table detection systems.

Deformable Convolutions
Deformable convolutions differentiate the conventional convolutions by providing the leverage of deformable modules. The deformable module learns the sampling matrix with the location offsets. The offsets are learned according to the previous feature maps through additional convolution layers. This process makes the receptive field dynamic and enables the convolutional filters to adapt to different scales. While Figure 8 depicts the basic intuition behind the deformable convolutional networks, thorough information about the architecture is explained in [47]. Most of the mentioned methodologies have employed conventional convolutions in their object detection frameworks to solve page object detection in document images. Recently, instead of conventional convolutions, deformable convolutions [47] are investigated to detect tables, figures, and formulas. This section highlights those approaches. Siddiqui et al. [27] proposed an approach to detect tables that leverages deformable convolutions in their object detection framework. The authors argue that deformable convolutions are better suited for the problem of table detection. Because of their dynamic receptive field, tabular areas belonging to various scales and aspect ratios can be localized conveniently. The authors employed Faster R-CNN [11] by replacing a conventional Feature Pyramid Network (FPN) with a deformable FPN module. After extensive evaluations, the authors proved that deformable Faster R-CNN had outsmarted the conventional Faster R-CNN for the problem of table detection in document images.
Younas et al. [28] exploited a similar approach by employing a deformable FPN module to detect formulas and figures in document images. Instead of providing raw input images to their deformable Faster R-CNN model, the authors have proposed an image transformation method identical to [40]. With the combination of transformed image representation and deformable object detection architecture, the authors have produced state-of-the-art results for the figure and formula detection on the famous ICDAR-17 POD dataset [5]. Along with the novel approach, the writers have also corrected the ICDAR-17 POD dataset [5] and have made it publicly at available https://bit.ly/2AUSlzI (accessed on 25 April 2021).

Dynamic Programming Based Approach
Yi et al. [36] introduced a deep learning-based graphical page object detection approach similar to the object detection algorithms. In the presented approach, a convolutional neural network designed specifically for page object detection proposed candidate regions that are refined through a dynamic programming approach instead of the wellknown non-maximum suppression method [48]. Tables, figures, formulas, and text lines are localized in document images by their system. The authors argue that page objects have a high variance in their aspect ratios, unlike objects in natural scenic images. Therefore, non-maximum suppression is not well-suited to detect all the page objects in a document image. The presented work compares the performance of their system with the conventional object detection approach of Fast R-CNN [31] and Faster R-CNN [11], and concludes that the dynamic programming-based approach has outperformed the rest of the methods.

Fully Convolutional Neural Networks
Along with object detection algorithms, Fully Convolutional Neural Networks (FC-NNs) [49] have been exploited to solve graphical page object detection in document images. The basic intuition behind FCNNs is assigning the label for each pixel present in an image. Figure 9 depicts the architecture of FCNNs and for further explanation, we refer our readers to [49]. Kavasidis et al. [35] posed the problem of tables and chart detection as a saliency detection problem. The authors propose that each class of page object can be referred to as a separate saliency category. To segment those categories (tables and charts), the system employs FCNNs where each pixel will be classified into tables, charts, or a background in a document image. The obtained saliency map is further propagated to the fully connected Conditional Random Field (CRF) [49], which smooths the system's output.

Datasets
Deep neural networks consist of a huge number of parameters. To achieve convergence, datasets with a massive amount of images are required to train these networks optimally [27,29]. Recently, the document image analysis community published several public datasets. Some of these datasets have provided annotations for various graphical page objects. This section will mainly cover the recently published datasets that contain information about the boundaries of tables, formulas, and figures. Figure 10 depicts few samples of these datasets. Figure 10. Sample document images taken from the various datasets of DocBank [13], ICDAR-13 [38], IIT-AR-13K [51], and PubLayNet [9]. Part (a,b) represent the highlighted graphical page objects in a document image.
Moreover, we discuss few datasets that only contain annotations for one of the three mentioned page objects, such as tables. Figure 11 depicts a couple of samples belonging to these datasets. Table 2 presents the summary of all the datasets covered in this section. Figure 11. Sample document images taken from the various datasets of ICDAR-17 POD [5], ICDAR-19 [52], TableBank [46], and UNLV [44]. Part (a,b) represent the highlighted graphical page objects in a document image. It is important to mention that most of the datasets illustrated in this figure have annotations for the tabular boundaries only. Table 2. Graphical page object datasets. It is important to mention that we have considered equation and formula as semantically equal in this table. Some of these datasets contain as many as 12 page objects [13]. For the sake of convenience, we have only included table, figure, and formula.

PubLayNet
In 2019, Zhong et al. [9] published a huge dataset for document layout analysis known as PubLayNet. This dataset is generated by automatically annotating the document layout of over 1 million PubMed Central TM PDF articles. The dataset contains various document layout categories, including text, title, list, table, and figures. Having more than 360 thousand annotated document images, this huge dataset facilitates the researchers to develop and evaluate advanced deep learning-based models for document page object detection.

DocBank
Another dataset to solve document layout analysis is released by Li et al. [13]. The dataset is known as DocBank, which is the extended version of the TableBank dataset [46]. DocBank is a novel large-scale dataset, and it is constructed by employing weak supervision from the LaTeX documents available on arXiv.com. The proposed dataset comprises 500 thousand document pages with 12 different kinds of semantic blocks such as tables, figures, equations, figures, lists, paragraphs, etc. The authors also define the training/val/test splits in which 400 thousand samples are used for the training purpose, whereas 50 thousand samples are allocated for validation and testing purposes. This large-scale rich dataset extends the opportunities to investigate the blend of deep neural networks employed in computer vision with the methods mainly used in document analysis.

Marmot
Marmot is widely utilized by scientists in the area of understanding the tables and formulas. This dataset has been published by the Institute of Computer Science and Technology (Peking University) and described in the paper proposed by Fang et al. [45]. This dataset consists of 2000 document images. These images are comprised of conference papers of both English and Chinese languages from 1970 to 2011. There is roughly a 1:1 ratio for both positive and negative images in the dataset. Due to the complex page layouts, this dataset is highly applicable for evaluating table detection systems. There were few instances of incorrect annotations in the dataset, which is corrected by Schreiber et al. [29]. The size of the dataset reduces to 1967 document images after the correction.

TableBank
During early 2019, in the community dedicated to table detection, Minghao et al. [46] recognize the requirement for enormous datasets and established TableBank. TableBank is a dataset consisting of 417 thousand labeled images utilizing tabular data. The dataset has been accumulated by gathering the information over the documents which are present in .docx format. The dataset also contains another form of information, that is, Latex documents which were accumulated from the arXiv 5 database. The publishers of this dataset suggest the usage of this dataset for both structural recognition and table detection tasks. The authors of this dataset claim that this large-scale dataset will enable the researchers to exploit the capacity of deep neural networks.

IIIT-AR-13k
Mondal et al. [51] have proposed a novel IIIT-AR-13k dataset. The dataset mainly consists of business-type documents. There are in total 13 thousand pages containing graphical elements like tables, signatures, figures, and so on. The bounding boxes are marked in a non-automated manner to construct the dataset. The authors generated this dataset manually, and it is one of the biggest datasets in the domain of graphical page object detection.

DeepFigures
Based on our knowledge, DeepFigures [53] is one of the most extensive free-to-use datasets to utilize for the task of graphical page object detection. It comprises more than 1.4 million documents along with the information of boundaries of tables and figures. The authors leverage the scientific articles found online on the databases like PubMed and arXiv to create the dataset. This large-scale dataset provides an opportunity to investigate the performance of table and figure detection systems in document images.

ICDAR-13
ICDAR-13 [38] is widely utilized for the problem of table detection and table structure extraction. The dataset consists of PDF files. These PDF files are converted into images. The dataset is composed of graphs, structured tables, text as information, and charts. However, It only provides annotations for structure data for table recognition and table detection. This dataset has 67 PDFs with 150 tables in which 27 PDFs are from the EU, and 40 PDFs are from the US Government. In total, this dataset has 238 images, from which 128 images contain table information. This dataset is often used for reporting and comparing.

UNLV
For document image analysis, UNLV [44] is a very well-known dataset in this field. This dataset is composed of various documents like business letters, magazines, reports, newspapers, etc. Even though this document has almost 10,000 images, only 427 images possess a tabular region. Often, the research community uses the images that contain the tabular regions to manage numerous experiments.

ICDAR-2019
In 2019, Competition on Table Detection and Recognition(cTDaR) [52] is executed in ICDAR. The competition proposes two new datasets: historical and modern datasets. The historical datasets encompass train schedules, simple tabular prints from old books, images from hand-written accounting ledgers, and so on. In contrast, modern datasets encompass samples from forms, financial documents, and scientific papers. This dataset has become a benchmark dataset to assess the performance of state-of-the-art systems for table analysis.

Evaluation
This section covers the well-known evaluation metrics that have been employed by the deep learning-based approaches to assess their performance and compares the results among various state-of-the-art approaches. Moreover, we will present the comprehensive evaluative comparison between the methodologies that are explained in Section 3.

Precision
The metric precision is defined as the ratio between the correctly predicted positives samples to the total positive samples. Figure 12 depicts the definition of precision for the problem of graphical page objects in document images. Mathematically, it is described as: where TP denotes the True Positives and FP represents False Positives. Figure 12. An instance of a precise and imprecise table detection. Green color represents the groundtruth tabular area whereas red color denotes the predicted tabular boundary.

Recall
The metrics recall evaluates the performance of the system by calculating the number of corrected predictions in the actual test set. It is calculated as follows: Recall = correct predictions Total correct annotations in ground-truth = TP TP + FN (2) where TP denotes the True Positives and FN represents False Negatives.

F-Measure
The harmonic mean of precision and recall is known as F-Measure. The formula for finding an F1 score is given by:

Intersection Over Union
Intersection Over Union (IOU) is a well-known evaluation metric commonly used in evaluating the capabilities of object detection algorithms. Since object detection techniques have been widely exploited to solve graphical page object detection, we have decided to discuss this metric in our paper. IOU calculates how much the area of the predicted bounding box intersects with the area of the actual ground-truth. For the sake of convenience, an example of computing IOU is illustrated in Figure 13. Mathematically, it is explained as follows:

IOU =
Area of Overlap region Area of Union region (4) Figure 13. Visual illustration of IOU in object detection methods. The bounding box with blue color represents the ground-truth whereas the bounding box with red color denotes the predicted bounding box. Considering the IOU threshold set to 0.5, only the first two predictions from the left will be considered true positives whereas the rest of them will be treated as false positives.
The problem of graphical page object detection is to localize the boundaries of formulas, figures, and tables. For the sake of visual convenience, we have divided the performance evaluation between the explained methodologies into three separate tables. The quantitative analysis for detection of tables, figures and formulas are summarized in Tables 3-5 respectively. Various approaches have evaluated their methods on distinctive IOU thresholds.

Evaluation for Table Detection
It can be observed by looking at Table 3, the instance segmentation-based architectures like Cascade Mask R-CNN has outperformed the rest of the approaches with a slight margin. It shows that the multi-scale classification module that has improved the generic object detection [54], has also advanced the table detection systems in document images. Table 4 compares the performance between the two recently proposed deep learningbased approaches for figure detection in document images. It is evident that the approach with deformable convolutions has outranked the instance segmentation-based approach. This is because of the dynamic receptive field that takes care of the figures having various scales and aspect ratios in the document images. The results also entail that instead of providing raw images to the deep neural network, transforming images through traditional document image analysis methods can yield better results.

Evaluations for Formula Detection
The performance assessment between the two novel approaches is explained in Table 5. Analogous to the figure detection, the approach with the blend of image transformations and deformable convolutions has out-smarted the other method for formula detection.
While evaluating page object detection systems, it is essential to mention that still there is a room for improvement to come up with deep neural networks that can localize and classify all the page objects present in a document image. So far, we have seen that particular methods or modules are utilized to detect various page objects.

Discussion and Conclusions
The process of extracting precise information from graphical page objects is a crucial and challenging problem in document image analysis and has received noticeable attention. The state-of-the-art page object detection systems have been remarkably improved due to recent advances in deep learning. This survey paper has provided a comprehensive overview of approaches that perform end-to-end graphical element detection in document images. Furthermore, this paper presents a structural taxonomy for the approaches according to the utilized deep learning method in Section 3. It compares these methods by highlighting their advantages and disadvantages in Table 1. Moreover, we explain the recently employed datasets in Section 4 and summarize their essential statistics in Table 2. Furthermore, we talk about the currently used evaluation criteria and analyze the performance of current deep learning based-graphical page object detection systems in Section 5. We conclude this survey paper with a discussion on the current difficulties and challenges in Section 6.1, and finally recommended some future directions in Section 6.2.

Difficulties and Challenges
After reviewing several methods in the field of graphical page object detection, we have noticed some key issues that deserve to be addressed. These are as follows: • First and foremost is that the current state-of-the-art performs better when the network is trained for a single type of graphical object, i.e., only for table or only for formula. The performance of a graphical page object degrades when it is trained to detect multiple graphical page objects in document images [5,32]. • The second critical challenge is low inter-class (between different classes) and high intra-class (within the same class) variation. Due to low inter-class variance, tables without ruling lines can easily be misclassified with algorithms or mathematical formulas and vice-cersa [27,55]. Similarly, a figure can be falsely predicted as a table and vice-versa on account of low intra-class variance. • The datasets differ significantly from each other. At present, several datasets only focus on a single graphical page object [38,46,52]. Therefore, there is a growing need for large-scale datasets that provide annotations for multiple page objects like figures, formulas, and tables [5,13]. • The recent two-stage object detection networks are generally gigantic in size [56,57]. It is not easy to process the images at their original resolution with limited computational resources. Therefore, some important features are compromised during the downsampling process in the case of detecting smaller graphical page objects such as embedded formulas [58]. • Most of the current state-of-the-art methods rely on some post-processing to obtain reliable results [28]. Therefore, more generic deep learning-based solutions are required that can detect distinctive graphical page objects in a diverse environment.
Because of the challenges mentioned above, we can conclude that standardization is required with diversity to tune the methods towards generic graphical object detection in document images. Moreover, the development of methods tailored only for graphical page object detection in document images can significantly improve the performance. This work is one effort to unify the performance of the deep neural network architectures for most renowned datasets.

Future Work
There are many possibilities to explore in order to improve the performance of graphical page object detection in document images. In general, recently proposed novel neural network architectures for object detection [56,[59][60][61] can improve performance of graphical page object detection systems. The second promising direction is the multimodal processing of the graphical objects. In the case of graphical page object detection, multimodal processing, in the simplest form, is the processing of image information and text information together [62,63]. An example of such a case is when a figure is categorized as a table and vice versa; the text information can be beneficial. The table is the most complicated graphical page object among all the graphical page objects [48]. To improve the performance further, another promising path to explore is the localization of individual columns and rows of the specified tables. Furthermore, identifying headers of the table can significantly help to understand the table's inner structure. Furthermore, the following directions can be explored in the future: • Weak/Unsupervised learning: At present, all the reliable graphical page object detection systems depend on large-scale labeled datasets. The processing of annotating the document images with graphical page objects is laborious and inefficient. Hence, there is a dire need to build weak/unsupervised graphical page object detection systems that produce impressive results after training limited samples. • Light weight systems: Modern state-of-the-art graphical page object detection methods are not efficient at all. However, there is a growing need to build intelligent information extraction systems that can work effortlessly on mobile devices [64,65].
• Domain adaptation: There is still a significant gap in developing clever page object detection methods that can adapt to different domains. An example of such a scenario is building a system that works equally well on the historical and modern document images. • Neural architecture search: Deep learning enables us to eliminate custom features engineering, which demands domain knowledge. However, the current employed deep neural network also requires setting precise hyperparameters. Another exciting direction could be leveraging neural architectural search to automate the design of a number of layers and anchor settings as accomplished in the field of computer vision [66][67][68][69].