Rethinking Learnable Proposals for Graphical Object Detection in Scanned Document Images

: In the age of deep learning, researchers have looked at domain adaptation under the pre-training and ﬁne-tuning paradigm to leverage the gains in the natural image domain. These backbones and subsequent networks are designed for object detection in the natural image domain. They do not consider some of the critical characteristics of document images. Document images are sparse in contextual information, and the graphical page objects are logically clustered. This paper investigates the effectiveness of deep and robust backbones in the document image domain. Further, it explores the idea of learnable object proposals through Sparse R-CNN. This paper shows that simple domain adaptation of top-performing object detectors to the document image domain does not lead to better results. Furthermore, empirically showing that detectors based on dense object priors like Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN are perhaps not best suited for graphical page object detection. Detectors that reduce the number of object candidates while making them learnable are a step towards a better approach. We formulate and evaluate the Sparse R-CNN (SR-CNN) model on the IIIT-AR-13k, PubLayNet, and DocBank datasets and hope to inspire a rethinking of object proposals in the domain of graphical page object detection.


Introduction
We live amid the digital age, surrounded by the digital universe, which is expanding with the daily data we produce. In this digital age, we are creating, consuming, and replicating data like never before [1]. The estimates tell us that from 2005 to 2020, our digital universe has grown 300 times, from 130 exabytes to 40,000 exabytes of data. It will continue to double every year. It is also estimated that as much as 33% of the digital universe contains information that could be potentially valuable if analyzed. Further, almost 40% of the data produced will be touched by the cloud in some way or another. Manually processing such large amounts of data is laborious, time-consuming, and borderline infeasible. This presents an excellent opportunity for data mining and analysis. To this end, the field of document digitization has received considerable attention and made rapid strides in recent years [2]. Document digitization is the automatic extraction of meaningful information from scanned document images.
Information is stored in two major formats in scanned document images: text and graphical page objects. Graphical page objects can be tables, figures, natural images, equations, formulas, logos, etc. [2][3][4]. They contain valuable information, but this information can only be extracted when these objects are correctly identified in the respective images [5]. Figure 1 shows the importance of detecting graphical page objects before extracting information.
Graphical page objects have high intra-class variations in shape, type, and scale [2,6]. Tables come in different shapes, sizes, and layouts (bordered or borderless) [3,[6][7][8]. There are also significant amounts of inter-class similarities [6]. Further, natural images can have superimpositions of vectors or diagrams. Figures can be arbitrary or abstractions of natural images. Equations and formulas can be single or multiple lines. A significant inter-mixing of classes can also occur, for instance, a logo inside a table or figure. Figure 2 illustrates the challenges in graphical page object detection.  Deep learning has been one of the most popular and explored solutions to object detection in the natural image domain. In recent years, it has seen rapid improvements. Graphical page object detection is a challenging downstream domain adaptation task of generic object detection in the natural image domain. This fact is not lost on many researchers, and they have come up with many adapted models for graphical object detection [2,11].
Document images are sparse in contextual information as compared to natural images. The graphical page objects of different classes, such as tables, figures, and text, are logically clustered together in document images. The same is not always true with natural images. The different graphical page objects show a wider variety of scale than objects in natural images. Further, most top-performing deep learning models on natural images have strong and deep backbones with texture-driven region proposal networks (RPNs) [4]. By design, these RPNs propose regions in natural images rich in texture, contrast, and color. Selective search in Fast R-CNN [12], and the RPN in Faster R-CNN [4,13] are examples of such.
The assumption of logical clustering may not always hold for natural images. Thus, it makes sense to have dense object candidates for the features maps to avoid missing objects in the image. When coupled with strong backbones, dense object candidates in document images can confuse the detector head. Hence, naive adaptation of models designed for dense natural images to the document image domain may not yield good results.
Most state-of-the-art graphical page object detection models suffer from the aforementioned systemic challenges and issues. They fail to account for the differences in the domains in an intrinsic manner. Instead, they rely on extrinsic methods such as heavy pre/post-processing of images that are not genuinely end-to-end. In an attempt to address these systemic issues, this paper investigates the feasibility of sparse and learnable object candidates in scanned images of documents and attempts to justify the applicability of the same through extensive experimentation. This paper also investigates the relationship between meaningful gains and the feature extraction capabilities of deep backbones. The scope of the paper has been limited to investigating the effectiveness of deep backbones and learnable sparse proposals for the downstream task of graphical page object detection. Our experiments show that sparse proposals-based methods such as Sparse R-CNN come close to state-of-the-art methods while being computational cheaper.
The main contribution of the paper can be summarized as follows: • Empirically shows that detectors based on dense object candidate priors are not well suited for graphical page object detection. Detectors that reduce the number of object candidate priors while also making them learnable are a better approach; • Investigates the effectiveness of deep and robust backbones using the CBNetV2 [14]  The remainder of this paper is structured as follows. Section 2 categorizes the prior work in the field into traditional approaches and deep learning approaches. Section 3 describes the deep backbones investigated, such as CBNetV2 (Section 3.1). Further, this section also describes the detection heads investigated, such as Faster R-CNN (Section 3.2.1), Mask R-CNN (Section 3.2.2), Cascade Mask R-CNN (Section 3.2.3), and Sparse R-CNN (Section 3.2.4). Section 4 describes the datasets employed in our experiments, the evaluation protocol, the implementation details of the models, and qualitative and quantitative analysis of our experiments, and it concludes with the cross-dataset evaluation for the models. Section 5 concludes the paper and provides a brief look into future work.

Related Work
The problem of graphical page object detection is quite challenging. The approaches to solving the problem can be broadly divided into two categories: traditional and deeplearning approaches. The traditional approaches can be further divided into rule-based and learning-based methods [2,11]. The rule-based methods rely on manually defined rules and heuristics for the detection of graphical page objects in document images, whereas learningbased methods rely on statistical learning from the data and standard machine learning algorithms. With the advent of deep learning, many deep learning-based approaches to graphical page object detection have been developed. The field has seen rapid progress in recent years.

Traditional Approaches
Some of the earliest approaches exploited the metadata from the output of optical character recognition (OCR) systems. Green et al. [16] used the metadata to construct custom grammars and pre-defined heuristics to detect graphical page objects. Tupaj et al. also exploited the metadata from OCR to preserve the geometry of white spaces in document images. The geometrical information was then used alongside heuristics to detect tabular structures [17]. Kieninger et al. used the metadata to extract text-level geometry, which was then used to locate and extract tables from document images [18]. Others, such as Hu et al. [19], used image processing techniques to calculate the correlation between white spaces in a document and performed a connected component analysis to predict the presence of tables and other graphical page objects in documents.
Some early work did not use document images. Costa e Silva et al. [20] employed statistical models such as Hidden Markov Models (HMMs) for table detection in documents. The method works by first parsing text from document files and then computing feature vectors by analyzing the white gaps between different objects in the document. Similarly, Shigarov et al. [21] proposed a method that does not utilize document images. It first extracts the metadata from documents. Next, it treats each word as an individual text block with bounding boxes. These bounding boxes are used to extract the boundaries for tables. Chandran and Kasturi [22] proposed a method for table structure recognition. First, both horizontal and vertical lines and white streams are extracted from the documents. Next, a structure interpretation step is followed to build a structure for the table form the extracted features. The traditional methods can detect tables in documents that are simple and do not have complex patterns or structure. Traditional methods rely on manual handcrafted features and rules to detect graphical page objects. Hence, these methods are highly susceptible to noise in the data.

Deep Learning-Based Approaches
In deep learning, graphical object detection is seen as a downstream task of generic object detection in natural images. One of the first works to adapt deep learning models from the natural image domain to the downstream task of graphical object detection was by Gilani et al. [23]. They applied Faster R-CNN to detect tables in document images. The document images are first transformed per pixel using a distance transformation mechanism. These transformed images are then processed by the Faster R-CNN model to aid in table structure recognition. Schreiber et al. [24] also apply Faster R-CNN [13] for detecting tables in scanned document images. They also proposed a novel deep learning approach for table structure recognition. In their work, they employ pre-trained backbones and showed that object detectors from the natural image domain could be successfully adapted for the document image domain on the ICDAR-13 dataset [25]. Following this in 2018, Vo et al. [26] proposed to combine Fast R-CNN [12] and Faster R-CNN using an ensemble paradigm to capitalize on the benefits of both object detectors. They achieved this by taking the region proposal from both Fast R-CNN and Faster R-CNN and applying the bounding box regression on their combination. They applied their ensemble model to the graphical page object detection problem and achieved promising results. By now, domain adaptation from the natural image domain to the document image domain had become quite popular. Siddiqui et al. [27] proposed DeCNT, which employed deformable convolutions to detect tables in document images. The authors empirically showed that the dynamic receptive field of the network due to the deformable convolutions can adapt better to the dynamic layout of tables in documents. Sun et al. [28] also applied Faster R-CNN for table detection, where the tabular area is retrieved by refining the coarse table detection and corner location produced by Faster R-CNN. This refining is achieved by grouping the corner belonging to the same table by coordinate matching and filtering out unreliable corners. In 2019, Saha et al. [29] proposed the GOD framework in which they investigated the performance of Faster R-CNN and Mask R-CNN [30]. They compared the two detectors for graphical page object detection in the document images. After exhaustively evaluating the two detectors, they concluded that Mask R-CNN performs better for graphical page object detection than Faster R-CNN.
Detection is the basis for more complex tasks such as table structure recognition, in which tables are first detected and then the structure is also extracted. To this end, Shigarov et al. [31] proposed TabbyPDF, which is a web-based system for detecting and extracting structures of tables located in PDF documents. The system uses a heuristic-based approach for table detection and structure recognition. Bordered tables are detected using rule lines that compose a rectangular frame, whereas border-less tables are detected using a bottom-up segmentation approach. The authors evaluate their approach on the ICDAR-13 dataset. Chi et al. [32] proposed GraphTSR, a graph neural network for recognizing the table structure in PDF files. First, through pre-processing, the cell contents and their corresponding bounding boxes are extracted from the PDF, then an undirected graph is produced using the cells, the edges of which are filtered using adjacent relation prediction through the attention mechanism. Finally, the predicted table structure is extracted from the labeled graph. TabStruct-Net, proposed by Raja et al. [33], approaches table structure detection as a relationship problem between the row and column of the detected cell. First, a ResNet backbone is used to detect cells from the table. Next, an alignment loss function is introduced to ensure that detected cells belong to the same row or column, followed by the post-processing step of producing an XML output of the table structure. One limitation of this method is its inability to handle tables with a large number of empty cells. Zang et al. [34] proposed SEM (split, embed, and merge), which exploits the multimodality of tables, as they contain both visual and text features. First, a fine grid structure of the potential rows and columns of the table is obtained. Next, using a visual module and text module, the two modalities are fused; the fused features are used to predict the table structure. Prasad et al. [35] proposed CascadeTab-Net, which tries to solve the table detection and table structure recognition together. Unlike most other detection networks that treat the problem of table detection and structure representation separately, CascadeTab-Net uses a single-shot method for both. They utilized the Cascade Mask R-CNN [36] detection head. More recently, CDeC-Net [37] utilized the dual-backbone paradigm introduced by CBNetv1 [38]. It has high-resolution dual ResNeXt101 backbones with deformable convolutions. The backbone is paired with a Cascade Mask R-CNN detector head. It allows the network to detect objects at different resolutions and achieve high accuracy, even at higher IoU thresholds.
As our work focuses on graphical page object detection, to this end, Table 1 shows a comparison of the different approaches we investigate in our work with prior work in the domain. The table lists the key characteristics of each approach along with its limitations and advantages that are drawn from the experiments we carry out for each approach and its corresponding methods.

Related Datasets
Table detection and structure recognition is a well studied area, and many standard benchmark datasets are available: UNLV [40], Marmot [41], ICDAR-13 [25], ICDAR-POD-17 [5], ICDAR-19 [42], and TableBank [43]. UNLV is one of the earliest datasets in the document object detection domain. The dataset consists of almost 10,000 document images. However, only 427 of them contain tables, and hence we only list the images containing tables in Table 2. Marmot, on the other hand, is one the largest early datasets for table detection in document images. It was introduced by Peking University and consists of 2000 images for which a ratio of almost 1:1 is maintained between the positive and negative samples. The original dataset suffered from incorrect ground truth annotations, hence we list the corrected version of the dataset from [24], which has 1967 images, in Table 2. ICDAR-2013 [25] is another popular dataset for both table detection and structure recognition. The dataset was constructed by converting PDF files to images. The dataset has 229 training images and 233 validation images. ICDAR-2017-POD (Page Object Detection) [5] is another dataset that, along with tables, also has information for the boundaries of formulas and figures. The dataset consists of 2417 images in total, where 1600 are for training and 817 are for validation. ICDAR-19 [42] is the latest dataset to come out of the ICDAR in 2019. The dataset consists of two types of document images: modern and historical. The modern set of documents is collected from scientific and commercial documents, whereas the historical documents are a collection of images of handwritten documents. TableBank, introduced by Li et al. [43], is one of the most well known datasets in the table detection literature. The dataset has 417,000 training document images pooled from word and latex documents. It has 2000 images for both validation and test. More recently, datasets for graphical page object detection have also emerged that, alongside tables, also have other page objects, such as figures, text, signatures, dates, equations, etc. Such datasets include IIIT-AR-13K [9], PubLayNet [44], and DocBank [45]. Table 2 shows the statistics of the various datasets. As our goal is to investigate graphical page object detection in document images, we restrict ourselves to IIIT-AR-13K, PubLayNet, and DocBank datasets, as they facilitate the general page object detection task (including tables).

Method
In recent years, many novel backbone networks have been developed that have consistently improved the performance of object detection [14]. First, the effectiveness of deep and robust backbones and their ensemble versions in the document image domain is investigated. To this end, Section 3.1 discusses the CBNetv2 compositing architecture this work investigates. Different detector heads are employed upon these deep backbones, which use novel methods for detection and segmentation. Section 3.2 discusses the different detection heads this study investigates.

CBNetv2: Composite Backbone Network v2
A key driver of advancements in object detection is the improvement in the backbones of the detectors that extract the features [14]. These improvements are often structural in nature. They also require high computing resources and expertise to achieve and validate. Another approach would be to use pre-existing, well-designed, validated deep backbones and design efficient ensemble techniques. CBNetv2 [14] is a step in this direction. The key idea of CBNetv2 is that it finds novel composition styles to ensemble identical existing pretrained backbones [14], which allows for greater flexibility and generalization capabilities as different detector designs can be explored to suit the task.
As seen in Figure 3, in CBNetv2 there are K identical backbones of which K − 1 are assisting backbones. Each backbone has its own feature pyramid network (FPN) consisting of L stages, and the output of each stage is denoted as x l . Let a composition connection be denoted as h l (x), which takes x = {x i |i ∈ [1, L]} as inputs that come from each stage of the FPN of the assisting backbone and outputs a feature map y l . Further, let g represent a 1 × 1 convolution layer and a batch-normalization layer. The proposed composition connections are discussed in Section 3.1.1.

Composite Connections
There are many ways to achieve feature fusion across backbones. The most simple is fusing features of the same level across the backbones. Equation (1) formulates this as the Same Level Composition. In a deep backbone, the higher-level features amass greater semantic meaning. Adjacent Higher-Level Composition, as seen in Equation (2), fuses the higher-level feature of the adjacent backbone with that of the lower-level feature of the current backbone. The converse of Adjacent Higher-Level Composition is Adjacent Lower-Level Composition, as formulated in Equation (3). Dense features can be built by combining the features from all the higher levels of the assisting backbones with that of the lower ones of the lead backbone. Termed as Dense Higher-Level Composition, it is formulated in Equation (4). Equation (5) describes the Fully Connected Composition, which tries to build comprehensive features by connecting features from all levels in the assisting backbones to that of the lead backbone. The authors in [14] show that Dense-Higher-Level Composition and Fully Connected Composition achieve the best results when tested on the COCO dataset [46]. Taking computational complexity into account, the authors in [14] use Dense Higher-Level Composition. Our implementation of the model also utilize the same.

Assistant Supervision
The authors in [14] recognize the positive correlation between depth and increased performance. However, increased depth leads to a regularization problem, requiring novel methods to improve convergence [47]. To tackle regularization and also increase depth, the authors propose assistant supervision. All the backbones and their detection heads are used during training, and the total loss minimized is given in Equation (6). In contrast, only the lead backbone is used during the inference, as given in Equation (7).

Detection Heads
Object detection aims to determine the locations of objects in a given image and their classes. This task can be broken down into two main steps: Target Region Selection and Classification and Regression. The methods applied to address these main steps lead to different detection head designs. Section 3.2 and its subsections discuss the different detection heads, such as Faster R-CNN (Section 3.

Faster R-CNN
Faster R-CNN [13] is a two-stage detector. It has two networks: a regional proposal network (RPN) and a detection network. The RPN proposes regions of interest (RoI) hence performing the target region. The detection network uses the ROIs proposed by the RPN for object bounding box regression and classification. Faster R-CNN builds upon Fast R-CNN [12]. In Fast R-CNN, there is a decoupling of the selective search and the detection network. The false negatives of the selective search directly affect the network's detection accuracy. Hence, this decoupling is not a great idea. It would be better if there were a correlation between the two. Therefore, the core idea is that Faster R-CNN does not utilize selective search for RoI proposals. As seen in Figure 4, it utilizes a small RPN instead. It hence drastically reduces the time cost for generating region proposals. The RPN outputs a set of rectangular object proposals, each with its objectiveness score. The regressor and the classifier then take these object proposals and objectiveness scores from the RPN. Finally, the classifier outputs the probability of predicted ROI as an object (foreground) or background. In contrast, the regressor outputs the predicted bounding boxes.

Mask R-CNN
Semantic segmentation is the process of classifying each pixel of an image as belonging to a particular label. This labelling does not differentiate among instances. Hence, it can be considered a classification problem on a per-pixel basis. Mask R-CNN [30] is a detection head for semantic and instance segmentation. Mask R-CNN builds upon the groundwork laid by Faster R-CNN.
As seen in Figure 5, Mask R-CNN extends Faster R-CNN with the addition of two more convolution layers after the RoI align layer. The additional convolutional layers produce the segmentation masks. The RoI align layer is a novel addition in Mask R-CNN. The layer does not digitize the cell boundaries of the target feature maps like the RoI pooling layer in Faster R-CNN. Instead, it utilizes interpolation to calculate the cell boundaries.

Cascade Mask R-CNN
Most deep learning detectors, such as Fast R-CNN, Faster R-CNN, and Mask R-CNN as previously discussed, tend to show a degradation in performance with increasing intersection over union (IoU) thresholds. This occurs for two main reasons: the overfitting during training due to exponentially vanishing positive samples and the mismatch during inference between the IoUs set for training and the ones given as input at inference time.
Cascade R-CNN [36] attempts to address these problems. Figure 6 shows that the core idea is refinement through cascading. A series of heads take the RoI features from the region proposal network (RPN). Each head has its classification and regression networks. The first head (Head 1) takes the RoI features and performs the first round of classification and regression. The output of the first head is treated as an input for the subsequent heads, producing a cascade effect.
This cascading architecture is essentially a resampling procedure. The architecture provides good positive samples to the next head, as later heads are more prone to false positives. This enables each subsequent head to operate at a higher IoU threshold. Further, this also tackles the mismatch between training and inference as the architecture and IoU thresholds are the same during training and inference. Finally, the last head's classification and bounding box predictions are used for optimal predictions. The addition of a segmentation mask branch extends it to Cascade Mask R-CNN.

Sparse R-CNN
The famous deep learning object detectors such as Fast R-CNN, Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN all rely on dense object priors. These detectors have many anchor points containing anchor boxes that densely cover different spatial positions, aspect ratios, and scales. The dense coverage helps to predict the classes and bounding box coordinates. In the case of single-shot detectors, let W be the width and H be the height of a feature map. If each anchor point is responsible for predicting k bounding boxes, then we have a large number W × H × k of bounding boxes for each feature map.
Further, in the case of two-stage detectors such as Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN, dense object priors are generated first and then, using the region proposal network (RPN), a sparse set of foreground proposal boxes are obtained from these dense object priors. Finally, these are fed into the classification and regression branches to produce bounding box and category predictions.
The authors of Sparse R-CNN [15] propose a different paradigm that moves towards thoroughly sparse object priors. As seen in Figure 7, the authors avoid using an RPN. They instead use a small set of proposal boxes, usually 100 or 300. These proposal boxes are learnable. The authors introduce the dynamic head to obtain them.
First, the initial set of randomly initialized proposal boxes in the dynamic head is used with the RoI align operation in extracting features from the network backbone. Next, each extracted RoI feature is convolved with a learnable proposal feature. The result is fed into its detection and classification head to predict the class and bounding box. Each head is essentially conditioned on these learnable proposal features. The proposal features essentially act as weights for the convolutions and give rise to the learnable proposal boxes.

IIIT-AR-13K
IIIT-AR-13K [9] is a new public and open source dataset for graphical page object detection. The dataset has a total of 13k images with manually annotated bounding boxes of graphical objects in five popular and different categories: table, figure, signature, natural images, and logos [9]. The dataset consists of annual reports of 29 different corporations and companies. These reports span a large time interval of over ten years. These reports are also in many languages such as English, German, French, Russian, and Japanese.
The authors argue that such diversity in document images helps train highly effective and practical object detectors in business documents and academic articles [9]. They demonstrate that it is possible to achieve good performance by training on a smaller dataset that is more diverse than one that has a large number of examples [9]. As seen in Table 3, one key drawback of the dataset is that it is heavily skewed towards tables. The dataset contains a few examples of the other classes, such as logos, signatures, and natural images. Table 3. Statistics of training, validation, and testing sets in the IIIT-AR-13k [9] dataset. Table  Figure Signature Natural Images Logo The PubLayNet [44] dataset is the largest public and open source dataset for document layout analysis. The automatic matching of the XML representations of PDF articles with their content helps create such a large dataset. These articles are from the medical domain. The dataset has over 360,000 document scanned images [44]. The dataset annotates the most popular and typical graphical objects, such as text, titles, lists, tables, and figures. The dataset goes a long way in the training of real-world object detectors. The PubLayNet dataset is frequently used as a corpus to train detectors to detect document layouts, especially in scientific articles. Table 4 shows the statistics of the dataset's training, validation, and test split. DocBank [45] is a unique dataset that utilizes weak supervision in its construction. The dataset is large, public, and open source, consisting of 500,000 document pages and having 13 classes. Table 5 shows the statistics of the most relevant classes to the graphical page object detection task in the training, validation, and test splits of the DocBank dataset. DocBank has both textual and layout information to facilitate document layout analysis and textual analysis using natural language processing (NLP) [45]. Most other datasets contain scanned images of documents and do not have fine-grained annotations. DocBank has token-level annotations and ample semantic annotations for figures, tables, and equations to facilitate common computer vision and natural language processing tasks (NLP). Table 5. Statistics of the most relevant classes to the task of graphical page object detection in the training, validation, and testing sets in DocBank [45] dataset.

Set
Abstract Equation Figure  List Paragraph Section Table   Training

Evaluation Protocol
Analogous to standard work in the field of graphical page object detection, the performance of the Sparse R-CNN model is assessed on the metrics of recall, precision, F1-score, and mean average precision (mAP) [46]. The IoU thresholds of the achieved results are also reported to facilitate comparison with other approaches.

Intersection over Union
Intersection over Union (IoU) [48] is a standard method for computing the intersecting region between the predicted and ground truth regions. Equation (8) expresses Intersection over Union (IoU) mathematically.

Recall
Let TP represent the true positives, and FN represent the false negatives. Recall [48] is defined as the ratio of true positives over the sum of true positives and false negatives. Equation (9) represents Recall mathematically.

Precision
Let TP represent the true positives, and FP represent the false positives. Precision [48] is defined as the ratio of true positives over the sum of true positives and false positives. Equation (10) represents Precision mathematically.

F1 Score
The F1-Score [48] is essentially the harmonic mean of the Recall and Precision. Equation (11) represents F1-Score mathematically.

Implementation Details
The Sparse R-CNN, baseline, and CBNetv2 models are all implemented by utilizing the MMDetection framework [49]. All the models utilize COCO pre-trained weights for the different detector backbones. All the models follow the pre-training fine-tuning paradigm with the first stage of the detector backbone being frozen. All input document images are resized to a maximum size of 1200 × 800 while preserving the actual aspect ratio of these images. The SR-CNN model utilizes multi-scaling to scale the width of images between 480 to 800 pixels. The baseline models are trained for 20 epochs, and the Sparse R-CNN-based models are trained for 36 epochs. We employ early stopping based on the validation accuracy for model selection for all models. We use the Adam optimizer for all Sparse R-CNN models and SGD for all baseline and CBNetv2 based models with a linear learning rate schedule with 500 warmup iterations, with a step weight decay of 0.0001 at 8 and 11 epochs, respectively.

Result and Discussion
This paper investigates the effectiveness of deep and robust backbone architectures with different detection heads in the document image domain. Further, the idea of learnable proposals, especially in the sparse paradigm, is also investigated. All this is set under the pre-training fine-tuning paradigm. Towards this end, in Section 4.4 and its subsequent sub-sections, we discuss the experimental results to support the validity of our hypothesis. This validation is achieved by experiments on individual datasets as noted in Section 4.1 and cross-dataset evaluation on the same datasets. The subsections of Section 4.4 are structured as follows: In Section 4.4.1, the IIIT-AR-13K dataset is utilized to investigate the effectiveness of deep backbones and the idea of learnable proposals in the sparse paradigm. In Section 4.4.2, the findings from the Section 4.4.1 are applied to the standard PubLayNet dataset, and the results are discussed. Section 4.4.3 investigates the cross-dataset generalizability of the models.

IIIT-AR-13K
Section 3.1 discussed the compositing architecture for deep learning backbones, and Section 3.2 discussed the different types of detection heads this study aims to investigate. This section discusses the experimental results of the IIIT-AR-13K dataset.

Establishing Baselines
The first thing needed for any comparison is the establishment of strong baselines. To this end, under the pre-training fine-tuning paradigm baseline, Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN models are trained on the IIT-AR-13K dataset. The training is per the implementation details noted in Section 4.3. Table 6 shows the mean average precision (mAP) at the Intersection over Union (IoU) threshold from 0.5 to 0.95 in accordance with the COCO evaluation protocol [46]. The table also shows the average precision (AP) at Intersection over Union (IoU) thresholds of 0.75 and 0.5. It is evident that Cascade Mask R-CNN establishes the strongest baseline for the IIIT-AR-13K dataset, reaching an mAP of 0.76.
This strong performance is due to the cascading nature of the detection head. Cascade Mask R-CNN allows each subsequent head to be enhanced by the localizing distribution of the previous head, hence giving better performance on the higher IoU threshold values. Mask R-CNN and Faster R-CNN perform equivalently. The reasoning is that in document images, the masks for graphical objects are rectangular and hence do not add much additional information compared to bounding boxes. Figure 8 shows the detection results as true-positives and false-positives for the baseline models in Table 6.

Analyzing Effects of Strong Backbones
Using strong backbones for object detection in natural images generally increases detection accuracy. Deep backbones can extract fine features and patterns. These fine features and patterns then help construct a high-level semantic representation of the data by the network. These semantic representations play a crucial role in object detection in natural images.
Nevertheless, the extraction of high-level semantic features is highly dependent on the depth of the network. Increasing the depth of the network and tackling the accuracy degradation problem is not as straightforward. To achieve this, many novel generalpurpose deep backbones such as ResNet [47] and even backbones from natural language processing (NLP) such as Swin-Transformers have been proposed. More recently, ensemble methods that combine two or more general-purpose backbones such as CBNet [38] and CBNetv2 [14] have been proposed.
The effects of strong backbones in the document image domain are investigated by applying Dual-backbone Swin-Transformer, and Dual-backbone ResNet50 backbones with different detection heads such as Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN. Table 7 shows the different combinations of backbones and detection heads explored. The core idea behind CBNetv2 was that using two identical backbones could lead to improved performance, as the two backbones will learn different features and aid in the detection of objects. Tables 6 and 7 show that the Dual-backbone version of Faster R-CNN performs marginally better than the baseline version. This can be caused by document images being simpler in context than natural images, and the two parallel and identical backbones do not learn different features, hence do not aid each other significantly in detection.
As seen in Table 7, the Dual-backbone architecture of CBNetv2 with different Swin-Transformers as backbones is explored. Apparently, with both Mask R-CNN and Cascade Mask R-CNN as detection heads, the results are considerably worse than the baselines. This can be caused by the fact that the extraction of fine features and patterns by the strong backbones ends up confusing the detection heads of the network. This means that, in the document domain, where the images are much less contextually rich than natural images, the naive use of deep and strong backbones degrades performance. Figure 9 shows the detection results for the models in Table 7.

Learnable Proposals (Sparse R-CNN)
The subsection Analyzing Effects of Strong Backbones in Section 4.4.1 empirically establishes that the naive use of deep and strong backbones in the document image domain does not lead to better detection results. Compared to natural images, document images are essentially sparse in contextual information. Most detection heads, such as Faster R-CNN, Mask R-CNN and Cascade Mask R-CNN, were designed for natural images.
These detection heads' region proposal network (RPN) depends heavily on texture, color, and contrast information to predict the Region of interest (RoIs). Features such as texture, contrast, and color are abundant in natural images. However, document images lack these features; hence, these RPN can struggle when applied to document images.  Further, due to the sparse nature of the document images, the application of detection heads that rely on dense object priors can lead to degraded performance. A better approach would be if the object priors were sparse and learnable. Sparse R-CNN achieves this. Table 8 shows the different ResNet backbones used along with the Sparse R-CNN detection head. The table also shows the Sparse R-CNN (SR-CNN) model. Even with a relatively shallower backbone of ResNet50, the Sparse R-CNN model outperforms the strongest baseline in Table 6.
With a deeper backbone such as ResNet101, the best results on the IIIT-AR-13K dataset are achieved. This may be because Sparse R-CNN employs learnable proposal boxes using the learnable proposal features in the dynamic head of the Sparse R-CNN detection head. At the same time, the deep backbone extracts high-level features relevant for detection.
As seen in Table 8, the authors in [39] evaluate the new one-stage YOLOF method on the IIIT-AR-13K. Their experimental YOLOF model achieves 0.588 mAP on the IIIT-AR-13K dataset. In comparison, the SR-CNN model achieves 35% higher raw mAP and considerably outperforms the model in the pre-training and fine-tuning paradigm.
Further, Table 9 shows the effects of deep and robust backbones on the SR-CNN model. To this end, the Sparse R-CNN head is paired with the Swin-Transformer generalpurpose backbone. The results saturate with the Swin-Transformer Base backbone, which is pre-trained on the ImageNet-1k dataset. Using more robust backbones trained on the ImageNet-22k dataset, such as Swin-Transformer Base and Swin-Transformer Large, results in degradation of the mean average precision (mAP). Figure 10 shows the graph of the F1 score vs. IoU thresholds for the SR-CNN model on the validation and test set split of the IIIT-AR-13K dataset. From the figure, it is evident that the the SR-CNN model performs well even on higher IoU thresholds for both the validation and test split of the dataset.

PubLayNet
Motivated by the results on the IIT-AR-13K dataset, the SR-CNN model is applied to the PubLayNet [44] dataset. Table 10 shows the results for the same. To the best of our knowledge, the state-of-the-art results on the PubLayNet dataset are achieved by CDeC-Net [37]. The CDeC-Net model has a Dual-ResNeXt101 backbone with deformable convolutions. The detection head is a Cascade Mask R-CNN. The model achieves a mean average precision (mAP) of 0.96 [37]. The SR-CNN model achieves an mAP of 0.93. The model can achieve these results without using memory-intensive deformable convolutions and a strong high-resolution backbone such as ResNeXt101. The results are close to the current state-of-the-art as achieved by the CDeC-Net model. At the same time, the SR-CNN model is simpler than CDeC-Net and utilizes readily available COCO weights. Figure 11 shows the detection results for the SR-CNN model on the PubLayNet dataset. Figure 12 shows the F1-score vs. IoU thresholds for the SR-CNN model.   Deep learning methods for object detection in document images are preferred over rule-based methods due to the fact that these methods generalize better over different and distinct datasets. To investigate its generalization capabilities, the SR-CNN model is evaluated over three datasets: IIIT-AR-13k, PubLayNet, and DocBank. Only the common classes among the datasets were utilized for the cross-dataset evaluation. Baseline Faster R-CNN is also evaluated for comparison. The results are presented in Table 11. Table 11 shows that the SR-CNN model gives a mean average precision of 0.56 and 0.45 on the PubLayNet and DocBank, respectively, when trained on the IIIT-AR-13K dataset. This may be because IIIT-AR-13K has a diverse collection of real-world images of company reports, which allows the model to learn a more general representation of document images and performs reasonably well on the much simpler dataset with document images of academic articles.
When the SR-CNN model is trained on the PubLayNet, it gives 0.510 and 0.323 mean average precision on the DocBank and IIIT-AR-13K datasets. The model's performance suffers on the IIIT-AR-13K dataset as the model is trained on a dataset that contains primarily academic and research articles. The IIIT-AR-13K offers an entirely different distribution of document images. However, when tested on the DocBank dataset, which consists mainly of document images from the academic domain, better performance is observed than for IIIT-AR-13k.
Finally, when the model is trained on DocBank and tested on IIIT-AR-13K and Pub-LayNet, it gives a mean average precision of 0.184 and 0.735 on the IIIT-AR-13k and PubLayNet, respectively. Hence, it can be inferred that the distribution of document images in the DocBank and PubLayNet datasets is similar. Therefore, the model achieves relatively good results when trained on one dataset and tested on the other. Table 11. The cross-dataset evaluation results for the SR-CNN and baseline Faster R-CNN models. We evaluate the models on the common classes of IIIT-AR-13k, PubLayNet, and DocBank.

Conclusions and Future Work
This paper investigates the effectiveness of deep and robust backbones in the document image domain. Further, it also explores the idea of learnable object proposals through Sparse R-CNN. To this end, the Analyzing Effects of Strong Backbones subsection in Section 4.4.1 shows that naively throwing the best and strongest deep learning backbones at the object detection problem in documents will not improve results. The same is shown by experimenting with strong backbones such as Swin-Transformers and an ensemble method of combining two or more strong backbones via the CBNetV2 framework. These backbones are paired with various detection heads, such as Faster R-CNN [13], Mask R-CNN [30], and Cascade Mask R-CNN, and the degradation in results is evident in Table 7.
Turning our attention to the detection head, where the argument is that it would be better if the object priors in the detection head were sparse and learnable. The Sparse R-CNN [15] detection head achieves this, formulating the SR-CNN model. The Learnable Proposals (Sparse R-CNN) subsection substantiates the above argument through experiments on the IIIT-AR-13k dataset and demonstrates that sparse and learnable proposals in the document image domain can lead to better performance. Table 9 also shows that the performance of Sparse R-CNN suffers when paired with a more robust and deeper backbone. This supports our claim that the naive use of strong backbones is not a great idea in document images.
Inspired by this insight in Section 4.4.2, the SR-CNN model is trained on the Pub-LayNet dataset and achieves a mean average precision of 0.936. To the best of our knowledge, this is close to the current state-of-the-art of 0.96 (mAP) achieved by CDeC-Net [37]. The SR-CNN model is more computationally efficient (six times less GFLOPs as compared to CDeCNet) and uses the readily available COCO pre-trained weights. Table 1 shows the comparison of the key characteristics of the explored approaches along with their advantages and limitations to facilitate understanding. In future work, we plan to extend this idea to the more challenging domain of document structure recognition in document images. There is also a possibility of exploring the role of different attention mechanisms in the sparse and learnable proposal paradigm.