Toward Semi-Supervised Graphical Object Detection in Document Images

: The graphical page object detection classiﬁes and localizes objects such as Tables and Figures in a document. As deep learning techniques for object detection become increasingly successful, many supervised deep neural network-based methods have been introduced to recognize graphical objects in documents. However, these models necessitate a substantial amount of labeled data for the training process. This paper presents an end-to-end semi-supervised framework for graphical object detection in scanned document images to address this limitation. Our method is based on a recently proposed Soft Teacher mechanism that examines the effects of small percentage-labeled data on the classiﬁcation and localization of graphical objects. On both the PubLayNet and the IIIT-AR-13K datasets, the proposed approach outperforms the supervised models by a signiﬁcant margin in all labeling ratios ( 1%,5%, and 10% ) . Furthermore, the 10% PubLayNet Soft Teacher model improves the average precision of Table, Figure, and List by + 5.4, + 1.2, and + 3.2 points, respectively, with a similar total mAP as the Faster-RCNN baseline. Moreover, our model trained on 10% of IIIT-AR-13K labeled data beats the previous fully supervised method + 4.5 points.


Introduction
Visual summary is an important aspect in a variety of applications, including summarizing the contents of a document and detecting graphical elements in the visualization pipeline. As a result, identifying and localizing graphical items will be an important step in document summary and analysis. With the increase in documents, it has become impractical to manually extract visual objects. Automated methods provide reliable, effective and efficient solutions for manual tasks. For instance, ref. [1] verified that the machine learning method performs better than humans in domain knowledge and attention-demanding tasks. Similarly, several automated methods [2][3][4] have been proposed to identify graphical objects, but these automated methods are typically rule-based, because the documents lack established dimension or structure [5].
The graphical page object detection aims at localizing and classifying multiple objects such as tables, images, and figures in a document. For instance, Figures 1 and 2 detect and localize objects in PubLayNet and IIIT-AR-13K datasets, respectively. As opposed to natural images and scenes, graphical objects have very little difference in their appearance; for example, Figure 1 (right) consists of the text block and a list of items block which appear to be similar, but it is necessary to classify them separately, making graphical page object detection more challenging.  Although defining rules for all graphical objects and running an optical character recognition tool helps localize objects, they cannot generalize to the new objects. This problem has resulted in the usage of deep learning models to detect graphical objects [6][7][8][9]. The deep learning models are rule independent and overcome the generalization problem with high precision. To train a deep learning model, a large amount of annotated data must be available, and as a result, either a manual labeling process or another pre-processing technique must be involved in generating high-quality data, which is both time-consuming and error-prone [10]. Owing to the concerns with training data, the problem changes its state from supervised to semi-supervised. This paper's primary goal is to leverage the unlabeled data in identifying graphical objects without compromising the mean average precision and performance. We leverage the recently proposed Soft Teacher mechanism [10] to design a semi-supervised pipeline for graphical object detection on scanned document images.
The semi-supervised deep learning techniques in object detection [11][12][13][14] follow a predefined pipeline where an initial detector is trained using a small amount of available annotated data, and this detector is used to generate pseudo labels for unlabeled data. Finally, the labeled data along with unlabeled data with pseudo labels are used to retrain the model. The quality of these pseudo-labeling approaches depends on the initial model, and a weak initial model may degrade the pseudo-labeling process.
To alleviate the weak initial model problem, the end-to-end semi-supervised framework simultaneously performs pseudo labeling for unlabeled data and trains a detector using these pseudo labels with few annotated data in each iteration. The approach trains two models, one for detection known as Student and the other for pseudo labeling known as Teacher. In addition to this, the teacher model is simply an Exponential Moving Average (EMA) of the student model; this ensures that the pseudo-labeling process is constantly updated by the detection process and vice versa. Therefore, both the models reinforce each other.
The problem of GOD can be formulated as a generic object detection problem where we treat document images and graphical page objects, such as figures and tables, as natural objects. Therefore, inspired by the idea of [10], we leverage the use of the end-to-end semisupervised framework on the graphical page object detection (GOD). Unlike the traditional pseudo-labeling approaches which are multi-staged but not end-to-end, the Soft Teacher ensures that a single round of iteration is complete from pseudo labeling to generating the loss by combining both labeled and pseudo-labeled data. Additionally, the framework provides a reinforcement effect, ensuring that the student model is always monitored by the teacher. Furthermore, this paper empirically shows that semi-supervised GOD methods can produce comparable results to fully labeled supervised GOD approaches.
The rest of the paper is structured as follows. Section 2 introduces multiple ways for object detection and categorizes them into rule-based, learning-based, deep learning and semi-supervised methods. The novel Soft Teacher framework is described in Section 3, which is followed by a dataset introduction in Section 4.1, evaluation protocol in Section 4.2, implementation details in Section 4.3 and results in Section 4.4. Finally, Section 5 concludes the paper with some future directions.

Related Work
There are numerous techniques for localizing objects in an image, ranging from conventional OCR rule-based systems to more current and accurate deep learning methods. Although each method aims to tackle the same problem, they confront a few issues with data, procedure, and performance. This section discusses the object detection methods, a few disadvantages, and a semi-supervised deep learning-based solution to overcome the problems.

Rule-Based Methods
Before the success of deep learning techniques, rule-based solutions for graphical object recognition began with the requirement to recognize tables and figures from the scanned text. There were various rule-based algorithms developed to localize the tables. They identify graphical objects in a document by using predefined rules; Ref. [2] uses the spectator mining technique to detect potential objects in a PDF document. This method assumes that white spaces or lines can differentiate a table cell.
To tackle the problems with varying table layouts, ref. [15] proposes a practical algorithm that can detect tables from various sources, including newspapers and company articles. Unlike most of the rule-based algorithms, which focus on detecting objects from structured documents, ref. [3] employs a correlation-based approach along with dynamic programming to identify tables in noisy handwritten documents. The fundamental problem of rule-based techniques is that they rely on predefined structure and content arrangement in documents.

Learning-Based Methods
The supervised learning models are the initial and most successful approaches to solving the graphical page object detection. With the popularity of machine learning in object detection, Ref. [16] used an SVM model to identify table cells in a document by creating 26 low-level features for each group of intersecting horizontal and vertical lines. The classifier then determines whether a cell belongs to a table or not.
In [17], the authors verify that the documents can be described by the MXY tree, which is a hierarchical representation. The method identifies tables by locating perpendicular lines and white spaces. Ref. [18] describes a probabilistic graphical model for document analysis. This model incorporates different document structures into Hidden Markov Models. Unsupervised learning algorithms are also proposed to cluster tables and text separately. The domain-independent technique proposed by [19] applies bottom-up clustering to word segments in recognizing tables. Although the learning-based models are accurate and independent compared to rule-based models, they do not generalize to new structures and need a large amount of training data.

Deep Learning Methods
With the success of Convolution Neural Networks (CNN) in the field of object detection and computer vision, the CNNs are also applied to predict and localize the documents; Ref. [20] proposes a method that first identifies table-like areas using a rule-based approach and applies a CNN on the output to detect potential tables. This method also considers non-visual features such as characters for better localization. To eliminate the dependency of rules and other metadata in training, Ref. [21] introduced DeepDeSRT, a two-stage approach for detection and structure detection, and later proved that this method generalizes to new structures. A Deep Convolution Neural Network (DCNN), which adds fully connected conditional random fields to convolution layers, performs multi-scale reasoning on visual cues for localizing objects. Hashmi et al. [22] made a few modifications to Cascade Mask R-CNN to identify mathematical formulas in document images.
The state-of-the-art object detection networks that depend on the regional proposals are introduced to the field of graphical object detection by [7]. They applied a Region Proposal Network (RPN), which shares full-image convolution features, enabling cost-free region proposals. There are many [23][24][25] region, pixel and connected component-based models proposed to identify textual and non-textual components in a document. The Faster-RCNN and Mask-RCNN detectors are used by [26] to localize and segment objects.
Based on a dynamic programming approach, [27] identifies region proposals in page object detection and [28] outlines and summarizes multiple deep learning approaches for graphical page object detection. The authors from [29] conduct performance analysis on neural networks which recognize tables in document images. The deep learning-based approaches are highly effective with some detectors trained on a particular dataset capable of detecting and localizing graphical structures on unseen new datasets [30]. The major disadvantage of the deep learning-based methods is the hunger for annotated data. These approaches require a lot of characterized data that is hard to obtain, and the annotation labeling process is manual, error-prone, and time-consuming.

Semi-Supervised Approaches
Due to the bottleneck in the labeling annotation process, the object detection techniques are leveraging unlabeled data via semi-supervised methods. Every day, numerous new documents are created that may or may not be related to current datasets and may represent new datasets. One of the early studies [31] employs semi-supervised learning on fully convolution networks to accomplish MS Lesion Segmentation. Attention-based semi-supervised deep networks proposed by [32] use region-attention to leverage unlabeled training data to segment medical images in an end-to-end fashion. Similarly, Ref. [33] exploits unlabeled endoscopic videos to learn representations of the target domain. The role of semi-supervised deep learning is employed in various fields such as label propagation [34,35], anomaly detection [36], and segmentation [37,38].
In image classification, consistency-based semi-supervised methods improve classification performance by using all the unlabeled input. In these methods, a consistency requirement is used, with multiple modifications of the same image producing comparable results. Ref. [39] modifies the model, Ref. [40] proposes a new regularization method based on virtual adversarial loss, and [41] uses randomized image augmentations. Selftraining approaches also known as pseudo-labeling approaches such as [42,43] train an initial classification model on unlabeled data to generate pseudo labels that help refine the initial classifier.
Semi-supervised object detection algorithms, such as image classification, can be consistency [39,44] or pseudo label based [11,45,46]. Some of the semi-supervised approaches are proved to perform better object detection than supervised approaches on the MS COCO image dataset [11,12,14,47,48]. These approaches are multi-staged, where an initial detector is applied to pre-labeled data before generating pseudo labels for the unlabeled data. Finally, the combination of annotated and non-annotated data is trained for accuracy. To improve the performance of these multi-stage detectors, an end-to-end pseudo label approach is proposed by [10], which simultaneously proposes pseudo boxes and performs detection training. The application of this model on the MS COCO dataset to detect and localize images showed significant improvement from the previous supervised and semi-supervised approaches. Hence, the primary objective of this paper is to leverage this novel end-to-end semi-supervised framework on graphical page object detection.

Proposed Framework
Semi-supervised object detection algorithms are broadly divided into consistency and pseudo labeled. The pseudo-labeled algorithms are multi-staged and run in two stages where an initial classifier is generated using the labeled data in the first stage, In the next stage, the initial classifier is used to create pseudo labels before updating the initial classifier with the combination of labeled and pseudo-labeled data. The Soft Teacher [10] framework, which is proven to produce supervised-like results in object detection is being used to prove the same for graphical page object detection in a pseudo-labeling framework with few modifications to the training process. Firstly, the framework is made end-to-end, without any initial classifier, and then, weak augmentation is used to generate pseudo labels. To address the issue of low training data, the strongly augmented data along with labeled data are used for detection training.
The framework comprises two models: the "student" model, which is in charge of detection training, and the "teacher" model creates pseudo boxes for unlabeled input. The teacher model is the student model's exponential moving average (EMA) and learns to generate pseudo labels on weakly augmented unlabeled data. In contrast, the student model is trained on both labeled and strongly augmented unlabeled data to minimize the loss. The teacher model generates two sets of pseudo labels: one for the classification task and the other for detecting bounding boxes. We employ the FixMatch technique [49] of using data augmentation for training two multiple branches.
During the training process shown in Figure 3, the training data are split into different batches, each of which contains a random selection of labeled and unlabeled data. The teacher model creates pseudo labels from weakly augmented unlabeled data, while the student model uses both labeled and strongly augmented data. The pseudo labels generated by the teacher model are exploited as ground truth for strongly augmented unlabeled data to accomplish detection training. In contrast to the traditional classification problem, which determines whether a specific image corresponds to a specific label, the localization problem should identify and localize several entities in the same image. Identifying a particular object can be treated as a classification task, and the localization branch can be handled as a regression task. The framework's total loss is the weighted sum of supervised and unsupervised losses L total = L sup + αL unsup (1) Here, L sup and L unsup represent the supervised and unsupervised loss, respectively, and if N la and N un represent the number of labeled and unlabeled images in a batch with I i la and I i un representing a particular labeled and unlabeled images, respectively, the supervised and unsupervised loss can be defined as below Both the teacher and student model are randomly initialized initially, and during the training, the teacher model, in addition to providing pseudo labels, guides the student model. This effect occurs because the teacher model is a simple EMA of the student. During the training of the teacher model, multiple pseudo boxes are generated, and the Non-Maxima Suppression technique is leveraged to discard a few pseudo boxes. However, there can still be multiple pseudo boxes after the suppression technique. To segregate the available boxes into foreground and background, a threshold is employed, and the predictions with box scores higher than the threshold are used as foreground boxes in training.

Limitations
Although the above base framework is capable of handling unlabeled data for training, there are two different problems. The first problem is due to the high threshold value for identifying the foreground boxes. Setting a high threshold value will result in higher precision, but the recall drops. Hence, if an IoU is used between the student and teacher-generated boxes for label assignment, most of the foreground boxes will be assigned as background. Another problem is that the localization accuracy and foreground scores are not strongly co-related. Therefore, the localization is inaccurate. These problems are solved in [10] by introducing two techniques, namely the Soft Teacher and Box Jittering, which handle the classification and regression losses of the unsupervised branch.

Soft Teacher
The Teacher model and pseudo boxes are leveraged to reduce the wrong assignment of foreground boxes to background boxes. The main idea of the Soft Teacher is to assess the reliability of each Student-generated box to be a real background, which is then used to calculate the background loss. Let b f g i and b bg i denote the set of foreground and background boxes; the unsupervised classification loss can be calculated as the following The N f g and N bg denote the number of images in b f g i and b bg i , G cls denotes all the pseudo labels generated by the teacher model. The L cls is the classification loss and r j denotes the reliability for j-th background box. Ref. [10] proved that the simple background score produced by the teacher model serves as the reliability score, and Figure 4 shows the classification bounding boxes that are generated during the training process.

Box Jittering
Due to a high threshold in identifying foreground boxes, the boxes generated are not accurate. In order to alleviate the issue, ref. [10] proposes a box-jittering approach in which a random box b i is identified across the real foreground box and is refined with the teacher model multiple times N jiter to generate more accurate box coordinates b * Hence, N jiter number of new box coordinate sets {b * i,j } are generated. For each of the boxes generated, a regression variance is calculated to identify the regression box with high localization accuracy. The box regression variance is defined bȳ Here, σ k denotes the standard deviation of the k-th coordinate of the refined box coordinate set {b * i,j }, σ * k is the normalized standard deviation and h and w are the height and width of the t-th refined box b t . A smaller box regression varianceσ i indicates better localization, but calculating the box regression variance for all the foreground boxes is highly time and resource-consuming. Hence, it is proposed to calculate it for foreground boxes with a threshold greater than 0.5. Finally, if G reg are the coordinates of all the regression pseudo labels generated by the teacher model, the unsupervised regression loss can be calculated using Figure 5 shows the regression bounding boxes that are generated after the box jittering approach in the training process.

Publaynet
The PubLayNet dataset is created by matching the XML representations and content of over 1 million publicly available documents and is primarily used for training the model. This dataset consists of 335,703 training images and 11,245 validation images for training and evaluation. Table 1 shows the distribution of labels in the training dataset. In addition to this, a new subset dataset termed sub-PubLayNet is created, which consists of two labels: Figure and Table. These data are used to cross-validate a PubLayNet trained model on the sub-DocBank and sub-IIIT-AR-13K datasets. Table 1. Label-wise summary of PubLayNet training dataset.

IIIT-AR-13K
The IIIT-AR-13K consists of a set of business documents, mostly annual reports. The dataset contains 9333 training images and 1955 validation images. Table 2 represents the class-wise training dataset distribution. Similar to the PubLayNet, a subset (sub-IIIT-AR-13K) dataset is created with two labels: Figure and Table. This subset dataset is used to train a Soft Teacher model and to validate the model on sub-PubLayNet and sub-DocBank datasets. Table 2. Label-wise summary of IIIT-AR-13K training dataset. The DocBank dataset consists of 16 classes. To understand the cross-validated performance of the DocBank trained model on other datasets, a new subset (sub-DocBank) is created, which contains the labels Table and Figure. Tables 3 and 4 show the total image and annotation count in the subset datasets sub-PubLayNet, sub-IIIT-AR-13K, and sub-DocBank, respectively. Table 3. Actual datasets are used to create subset datasets that only consist of the Table and Figure  annotations. The table represents the distribution of the Table and

Evaluation Protocol
The partially labeled data creation follows the STAC [52] setting, where the training data are divided into 1%, 5% and 10%, respectively, and these samples are considered annotated data in these three training settings, respectively. The rest of the data in each of the setting are used as unlabeled data in the training process. For each protocol, STAC provides five different folds, and the final performance is the average of all the five folds. To compare the supervised training models with this semi-supervised framework, the unlabeled data are treated as not useful for the training of supervised models. The mean Average Precision (mAP) on the validation data is evaluated for the comparison of various semi-supervised models.

Precision
The Precision [53] is the fraction of relevant instances (True Positives) among the retrieved instances (True Positives + False Positives).

Recall
The Recall [53] is the fraction of relevant instances (True Positives) that were retrieved (True Positives + False Negatives).

F1-Score
The F1-score [53] is the harmonic mean between the Precision and Recall. It is mathematically defined as follows

Intersection over Union (IoU)
The Intersection over Union (IoU) [22], which is also known as the Jaccard index, measures the similarity between finite sample sets. It is defined as the ratio between the size of the intersection and the size of the union of the two sample sets. In the machine learning setting, it predicts the region between the predicted and ground truth region.

Average Precision (Ap)
Average Precision is defined as the area under the Precision-Recall curve. In the context of object localization, the average precision is defined for multiple IoUs. For instance, the AP@[0.5] indicates the area under the Precision-Recall curve when the IoU threshold is set to 0.5. In this case, an element is correctly classified if it overlaps with 50% of the ground truth. In the context of MS COCO [48] evaluation, the Average Precision of the classes is averaged to get the final Average Precision.

Mean Average Precision (Map)
The mean Average Precision (mAP) is calculated by taking the mean of all classes average precision over all the IoU thresholds defined. For MS COCO [48] evaluation, the mAP is averaged over 10 IoU thresholds from 0.50 to 0.95 with a step size of 0.5.

Implementation Details
For the PubLayNet Partial data (i.e., based on the percentage of annotated data) and the PubLayNet intersection data (i.e., based on the percentage of annotated data for Table  and Figure), the implementation considers Faster R-CNN [7] equipped with FPN [54] as the default detection framework for the evaluation. Along with the ImageNet pre-trained ResNet-50 [55] and ResNet-101 [55] backbones on 1%, 5% and 10%, a Swin-T [56] model on 5% and 10% of annotated data is also trained for the comparison. Anchors with five scales and three aspect ratios are used, and 2000 and 1000 region proposals are generated with a non-maximum suppression threshold of 0.7 on both training and inference. Finally, in each training step, 512 proposals are samples from 2000 proposals as box candidates to train RCNN.
All the models are trained for 150,000 iterations on two GPUs with eight images per GPU batch size. The foreground threshold is set to 0.9, and the data sampling ratio, which is the ratio of annotated to non-annotated images in each batch, is set to 0.2 and gradually decreases to 0 in the last 10,000 iterations. For the stochastic gradient descent, the learning rate is set to 0.01 and is divided by ten at 110,000 iterations. To identify the pseudo labels for bounding boxes with high localization reliability, the N jiter is set to 10 with a threshold of 0.02. In addition, the same augmentation techniques as in [52] are considered for training.
The hyper-parameters foreground threshold, suppression threshold, and N jiter are proved to be optimal by [10] while training the Soft Teacher model on the MS-COCO object detection dataset [48]. Finally, the IIIT-AR-13K is trained only for 50,000 iterations with Faster-RCNN equipped with FPN using ImageNet pre-trained ResNet-50 as the backbone. The learning rate is set to 0.01, and other parameters are similar to the PublayNet trained models, since the IIIT-AR-13K data in all the splits are small (less than 1100 images).

Publaynet
In this section, the results of multiple models trained on the PubLayNet dataset with the proposed framework are compared with the models trained using supervised techniques. The partially annotated PubLayNet models are first compared with the corresponding supervised models. Table 5 shows the comparison of the Faster-RCNN supervised models on the ResNet-50 backbone with different versions of Soft Teacher that are trained using Faster-RCNN and Swin-T transformers. For the detailed comparison, the Faster-RCNN is further trained using two backbones, i.e., ResNet50 and ResNet101. The semi-supervised models showed around 4.9 points, 5.8 points and 5.5 points improvement in the bounding box mean Average Precision when compared under 1%, 5% and 10% of annotated data, respectively.
Further, a Soft Teacher (Faster-RCNN + ResNet101) model which is trained on 10% of labeled data for 180,000 iterations with a batch size of 8 is compared with the existing baseline model by [26]. The baseline models are trained using fully labeled PubLayNet data on Faster-RCNN for 180,000 iterations. The results in Table 6 show that the Soft Teacher model outperforms the supervised model's average precision at (IoU = 0.5) by 2.7% and exhibits a similar mean Average Precision of around 90.0%. Finally, the inference times in terms of FPS of our models are shown in Table 5. Although there is no reference FPS from the earlier methods, we tried to compare inference time with our models and found that our model equipped with Faster R-CNN and ResNet50 outperformed ResNet101 and Swin-T. In addition to the total mAP, the label-wise AP comparison is shown in Table 7. The graphical objects Table, Figure and List outperform the baseline model by 5.4%, 1.2% and 3.2%, respectively. Moreover, Table 8

IIIT-AR-13K
The Supervised Faster RCNN and Soft Teacher models trained on IIIT-AR-13K with ResNet50 backbone are compared under three different scenarios of labeled data: 1%, 5%, and 10%. Table 9 shows that the proposed framework performs better than the Faster RCNN model by 6.4%, 1% and 5.1%, respectively. Additionally, a YOLO F [57] model trained on the complete data is compared with the Soft Teacher model that is trained on 10% labeled data. The Soft Teacher model showed a 3.7% mAP improvement, and it outperforms the YOLO F model in detecting natural images, logos, and signatures. Tables 10 and 11 shows the detailed comparison of YOLO F and Soft Teacher models along with the inference times. Figure 2 show the qualitative results of two IIIT-AR-13K, and Figure 9 displays a few samples of True Positives and False Positives documents trained using the Soft Teacher model. Finally, Table 12 and Figures 10 and 11 visually explain the Precision, Recall and F1 scores of a soft teacher model trained on 10% of IIIT-AR-13K labeled dataset.

Cross-Validation
The datasets PubLayNet, IIIT-AR-13K, and DocBank are the most common datasets used for graphical objected detection, and the three datasets possess objects which are different from each other. To validate a model trained on one dataset with another dataset, there is a need to train models with the objects common to all the datasets. Table and Figure are the common classes among datasets, and they are different from each other; for instance, the IIIT-AR-13K dataset is based on business documents, the PubLayNet is created from publicly available docs, and DocBank is created in a way to ease layout analysis. The primary idea is to identify how similar the datasets are in terms of the graphical objects. The new datasets sub-PubLayNet, sub-IIIT-AR-13K and sub-DocBank that contain Table  and Figure  The results of models trained on sub-PubLayNet and cross-validated on sub-IIIT-AR-13K and sub-DocBank are shown in Table 13. The Soft Teacher model trained on 10% of labeled trained data on Faster-RCNN with ResNet101 showed the highest mAP of 95.1%. The validation performed on other datasets proves that sub-DocBank is more similar to sub-PubLayNet than the sub-IIIT-AR-13K. Table 14 also proves that sub-DocBank is similar to sub-PubLayNet, and interestingly, the mAP on sub-PubLayNet is higher than that of sub-DocBank, even though the model is trained completely trained on sub-Docbank. From Table 15, it is shown the models trained on IIIT-AR-13K perform badly on the other two datasets. This is because the dataset has mostly business reports, unlike the other two datasets. Figure 12 depicts the qualitative evaluation of the table and figure on the sub-PubLayNet dataset by displaying samples of True Positives and False Positives.

Methods
Labeled Data Table Figure  mAP  Table Figure  mAP  Table Figure

Sub-DocBank Sub-PubLayNet Sub-IIIT-AR-13K
Methods Labeled Data Table Figure  mAP  Table Figure  mAP  Table Figure

Conclusions and Future Work
This paper presents the semi-supervised framework for graphical page object detection by employing a multi-stage semi-supervised technique known as Soft Teacher for the detection of graphical objects. In addition to operating on minimal data, this approach combines the complex process of pseudo labeling into a single pipeline. As the framework generates pseudo labels simultaneously with detection training, the flywheel effect occurs, which means one model constantly refines the pseudo boxes generated by the other model during the training process. This framework refines the classification and regression pseudo boxes using two different techniques, Soft Teacher and Box Jittering. These two processes work independently to render accurate classification and bounding box predictions.
This approach outperforms the supervised models in the labeling ratios (1%, 5%, and 10%) of PubLayNet and IIIT-AR-13K training data. Additionally, the Soft Teacher models trained on 10% PubLayNet labeled data performed similarly to the existing supervised baseline, whereas the Soft Teacher model on IIIT-AR-13K outperformed the YOLO F model. Finally, new datasets are created to perform inter-dataset cross-validation and prove that PubLayNet and DocBank datasets are more related compared to IIIT-AR-13K.
In future work, we plan to use powerful two-stage detectors such as Cascade-RCNN for better detection results compared to the current Faster-RCNN. Furthermore, we will examine the effect of the percentage of labeled data on the final performance and try to design robust models relying on minimal annotated data. Finally, the aforementioned results show that the proposed framework with Faster R-CNN performs similarly to Faster R-CNN; nevertheless, we must show that this statement holds for multiple backbones. Finally, to establish a baseline for semi-supervised graphical page object detection, we want to assess various semi-supervised frameworks, including consistency and pseudo labeling.