Continual Learning for Table Detection in Document Images

: The growing amount of data demands methods that can gradually learn from new samples. However, it is not trivial to continually train a network. Retraining a network with new data usually results in a phenomenon called “catastrophic forgetting”. In a nutshell, the performance of the model on the previous data drops by learning from the new instances. This paper explores this issue in the table detection problem. While there are multiple datasets and sophisticated methods for table detection, the utilization of continual learning techniques in this domain has not been studied. We employed an effective technique called experience replay and performed extensive experiments on several datasets to investigate the effects of catastrophic forgetting. The results show that our proposed approach mitigates the performance drop by 15 percent. To the best of our knowledge, this is the ﬁrst time that continual learning techniques have been adopted for table detection, and we hope this stands as a baseline for future research.


Introduction
Tables in documents present crucial contents, concisely and compactly. Since tables come in different layouts, even modern optical character-recognition-based methods [1][2][3] cannot yield satisfactory performance when extracting information from tables. Therefore, before retrieving tabular information, identifying their position in a document is essential, mainly referred to as table detection. Table detection has been a challenging and open problem for the last couple of decades [4,5]. The fundamental obstacle is the object confusion between tables and other graphical elements in a document that share a similar appearance [6]. Two factors mainly cause this confusion. First, the inherent high intra-class variance due to several distinctive designs and scales in tables. Second, low inter-class variance due to similar appearance to other graphical page objects (figures, algorithms, and formulas).
Earlier approaches have addressed such issues by exploiting the external meta-data in documents [7]. Unfortunately, the performance of these methods is usually fouled on scanned document images. Recently, deep-learning-based techniques have given fresh life to the field [8]. We observe a direct association between the advances of table detection and object detection in computer vision [9][10][11][12]. The idea of formulating a table as an object and the document image as a natural scene has produced state-of-the-art results on several publicly available table detection datasets [13,14].
Although these modern deep-learning-based methods are impressive in many ways, they have some drawbacks. For one, they are data-hungry and only work best when they are exposed to all the possible samples. However, in real-life situations, data scarcity is the norm. Another issue is that artificial neural networks, unlike their natural counterparts (human brain), are not necessarily able to add to their knowledge with the new data that are fed to them. The problem is catastrophic forgetting [15]. This phenomenon is associated with the degradation in the model's performance on prior cases when it is adopted to a new set of data. The primary reason for catastrophic forgetting is that learning necessitates changing the neural network's weights. However, these changes result in forgetting the prior knowledge. Many researchers studied ways to learn from new data over time [16]. In the literature, different names are used to describe these techniques (e.g., life-long learning, incremental learning, on-line learning), but here we refer to them as continual learning (CL). The idea of CL is to keep learning from incoming data while preserving the knowledge gained from previous trainings [17]. Figure 1 illustrates the fundamental difference between conventional approaches and our proposed method that leverages continual learning. As it shows, retraining a model on new data results in lower performance. However, a continual learning method preserves the prior knowledge while learning the new knowledge.  With the advent of deep learning, we see that enormous datasets are annually introduced for table detection [18,19]. In order to demonstrate state-of-the-art results, all prior methods require extensive training on the new datasets, hence overriding the previous knowledge. To address these limitations, we propose a novel end-to-end trainable table detection method that leverages a continual learning mechanism. The method is robust to the inclusion of novel datasets and exhibits impressive performance on the previously trained data. To investigate the effectiveness of continual learning, we treat the identification of tables as an object detection problem and employ Faster R-CNN [9] and Sparse R-CNN [20] to detect tables in document images.
The primary contributions of this work are threefold: • To the best of our knowledge, this work is the first attempt to incorporate a continual learning-based method for table detection.

•
Extensive experiments are conducted on considerably large datasets with more than 900 K images combined. • The presented method is able to reduce the forgetting effect by a 15 percent margin.
The rest of the paper is organized as follows: In Section 2, the previous attempts for table localization are reviewed, followed by an analysis of CL exploitation. Our proposed methodology is elaborated in Section 3. The results of the experiments that demonstrate the effect of CL are presented in Sections 4 and 5 concludes the paper. Table detection has been an open problem for a few decades. This section discusses earlier rule-based methods, followed by the recent deep learning approaches. Finally, we highlight current methods that have incorporated continual learning in various fields.

Rule-Based Table Detection
The incorporation of rule-based methods for table detection originates from the work of Itonori et al. [21]. They proposed heuristics that capitalized the position of text blocks and ruling lines to detect tabular boundaries in documents. In a similar direction, ref. [22] exploits ruling lines in tables to tell them apart from regular text. Pyreddy and Croft [4] presented another method based on custom heuristics that detects structural elements and filters the tables in a document. Similarly, specific tabular layouts and grammars have been designed to identify tabular structures [23,24]. For the comprehensive summarization of rule-based methods, we refer our readers to [7,[25][26][27][28].
Hao et al. [38] applied Convolutional Neural Networks (CNNs) to retrieve spatial features from a document and merged them with PDF meta-data to detect tables. Gilani et al. [10] formulated table detection as object detection and used a generic detector, Faster R-CNN [9], to detect tables in document images. In a similar line of work, ref. [29] incorporates deformable networks [39] to tackle varied tabular layouts. Saha et al. [40] exploited Mask R-CNN [41] to improve overall performance table, figure, and formula detection in document images. Cascade Mask R-CNN [42] has also been employed to enhance the performance of table detection with a conventional backbone network [13] and recursive feature pyramid networks [14]. The recently proposed Hybrid Task Cascade network [11] was merged with deformable convolutions to produce impressive table detection results [12].
In addition to object-detection-based methods, earlier methods leveraged the concepts of Fully Convolutional Networks (FCN) [43,44] and Graph Neural Networks (GNN) [37,45] to resolve table detection in document images. It is essential to emphasize that all prior works in the field of table detection heavily rely on dataset-specific training to yield satisfactory results.

Applications of Continual Learning
Catastrophic forgetting has been tackled by researchers with various techniques in different domains [15,46]. In some works, a regularization technique is applied to different parts of the model such as the loss function, learning rate, and the optimizer; and in others, a form of dynamic architecture or parameters isolation is practiced for learning different tasks continually. Some means of rehearsal process is also exercised [47]. For image classification, Kirkpatrick et al. [48] introduced the elastic weight consolidation (EWC) to alleviate forgetting. In their approach, any modification to the important weights of the network is penalized. In [49], authors present an EWC-based method for incremental object detection. According to their findings, when the annotations of the old classes are missing, EWC misclassifies previous categories as background. Therefore, they proposed the pseudobounding process for annotating old classes on the new set of images.
Rebuffi et al. [50] proposed iCaRL for image classification. In iCaRL, a set of examplars for each class is selected dynamically and used for replay with a knowledge distillation technique [51]. In [52], authors proposed deep model consolidation (DMC) for incremental learning that can be applied in both image classification and object detection. In their approach, a double distillation loss is used to combine the two models that one is trained on the old classes and the other is trained on the new ones.
In [53], authors developed a variant of Fast R-CNN [54] with a class agnostic region proposal [55] for object detection in a class incremental setting. A distillation loss was also used to reduce forgetting when training on the new objects with a frozen copy of the learned model on the previous set of classes. RODEO [56] applied the experience replay procedure using a buffer memory comprising of compressed representations of the past samples. Another recent work that applied experience replay is presented by Shieh et al. [57]; in their method, images from the former task are concatenated with the new samples for class incremental object detection.
Considering the performance and adaptability of the experience replay approach, we employed it for detecting tables in a continual manner. It is expected that by satisfying the shortage of lifelong learning for table-detector networks, the employment of emerging datasets becomes easier for this step of OCR pipeline. The following sections will explain and examine the proposed solution.

Experimental Setup
The purpose of this study is to continuously train a network with the new data while preserving prior knowledge. In recent years, multiple datasets have been published for table detection with existing demands for more labeled data; thereby, we define the continual learning for table detection as follows. Suppose D 1,2,...,t−1 is an array of multiple datasets, and M t−1 is a model that has been trained on them. At the event of introducing a new dataset at time, t, different scenarios are possible. Figure 2 displays four of the possible ways of consuming the new dataset. In the following, we will describe them along with our proposed experiments.

Independent Training
This is the conventional method of training in which a model is trained on each dataset. The Algorithm 1 shows the straightforward batch training procedure which is used here. The results of this experiment will show the upper bound of possible learning with current data and architecture. Figure 2a presents this training process.

Joint Training
In the joint training, all the available datasets are exploited. This setup acts as an upper bound of the learning capability of the model using all the available data. As presented in Figure 2b, all the available samples are shuffled before the batch training. In the fine-tuning approach, the model is trained on a new dataset with the initial parameters obtained from training on the previous datasets. (d) In the experience replay approach, at first, the model is initialized with the parameters attained from the previous learning stages on the former datasets; then, the model is trained on a new dataset and a replay memory (that is randomly selected from the former datasets).

Fine-Tuning
The classical fine-tuning procedure is implemented for this experiment. As Figure 2c shows, during the training of D t , a pre-trained model on previous datasets, M t−1 , is employed for initializing the parameters of the model. Afterwards, the model is retrained with a lower learning rate on the new instances. Since this setup will result in catastrophic forgetting, its performance is the lower bound for the learners.

Training with Experience Replay
The last experiment is the continual learning technique that we devised for our task, called experience replay (Figure 2d). In this approach, R 1,2,...,t−1 is a small memory that is dedicated to images of the prior datasets. These images are then presented to the model while the model is trained with the new data. To be precise, each batch contains samples from both D t and R 1,2,...,t−1 .
Algorithm 2 describes our batch training with experience replay. Assume that the training procedure is supposed to proceed for D t and we have the prior data and the trained model at our disposal. The algorithm begins by initializing the replay memory, R 1,2,...,t−1 . It is a random selection of images from D 1 , D 2 , . . . , D t−1 . In each iteration of training, a mini-batch is chosen from the D t and another from R 1,2,...,t−1 . These batches are then concatenated in one batch, and one step of gradient descent is taken by them.
The number of images in R 1,2,...,t−1 will be equal to one percent of the number of training samples in D t . In this manner, we can ensure that the memory is neither too small nor too large for preserving the past knowledge while learning the new ones. Its images are selected randomly from D i s with respect to their size. If s D i designates the number of training samples in dataset D i , then the number of images from dataset D i that reside in R 1,2,...,t−1 are achieved from (1) and represented by C D i :

Networks
In order to validate the proposed approach, two state-of-the-art architectures are selected: Faster R-CNN [9] and Pyramid Vision Transformer (PVT) [58,59] together with Sparse R-CNN [20]. Faster R-CNN is considered as a classic baseline in many previous works; therefore, it was our first choice. Next, the Sparse R-CNN+PVT architecture was chosen, which is one of the recent SOTA detectors.

Faster R-CNN+ResNet
Faster R-CNN is a two-stage detector proposed in 2015 [9]. In the first part of this architecture, a deep CNN is used for extracting feature maps from the input image. We employed ResNet-50 [60] for this purpose. Contrary to Fast R-CNN [54], which uses a selective search algorithm to find the region of objects, the Faster R-CNN utilizes a module called region proposal network (RPN) to approximate the possible locations of each object in the image. Using RPN, Faster R-CNN can effectively cut the prediction time and improve the accuracy. Moreover, the RPN allows the Faster R-CNN to be end-to-end trainable. After detecting the possible region of interests (ROIs), the feature map of each ROI is fed to a fully connected network consisting of two branches at the final layer. One branch is a softmax layer to predict the class of objects and the other is a box-regression layer for computing the coordinates. The architecture of Faster R-CNN is seen in Figure 3.

Sparse R-CNN+PVT
Pyramid Vision Transformer made the first convolution-free object detector possible [58]. As proposed in [59], the combination of PVT with Sparse R-CNN creates a strong end-to-end method for object detection. In PVT, a progressive shrinking pyramid extracts multi-scale features. Similar to traditional CNNs, as the network grows in depth, the output resolution progressively shrinks. Moreover, an efficient attention layer was designed to further reduce the computation cost.
Given the Figure 4, the structure of PVT consists of one main stage repeated four times in order to simulate the pyramid approach, which reduces the size of the feature maps. Inspired by the transformer idea in language translators, the input image must be tokenized. Therefore, the mentioned stage in PVT converts the input to some patches as a dictionary of tokens. So, at the first stage, the input image of size H × W × 3 is split into H×W 4 2 patches; then the patches are flattened, and a linear projection process is applied to the patches in order to attain embedded equivalents. Afterwards, the embedded patches and their positions are input to another block called transformer encoder, which includes a spatial-reduction attention (SRA) layer. Ultimately, by reshaping the output of the transformer encoder block, the feature map F 1 is obtained. By taking a closer look at Figure 4, we can observe that by applying the mentioned processes to the output of the previous stage, the new feature maps F 2 , F 3 , and F 4 are produced. After that, these feature maps are fed into a sparse R-CNN [20] for object detection. Unlike the conventional RPN, which requires thousands of anchor boxes, sparse R-CNN relies on a small set of learnable proposal boxes. These predictions are then refined in multiple stages.  . Each stage's output is passed to the next layer while the first two dimensions are halved (rows and columns). The finished map will be 16 times smaller than the input yet with a greater depth.

Implementation Details
In every evaluation, both the Faster-RCNN and the Sparse R-CNN were trained using the publicly available MMDetection toolbox [61]. The training procedure and environment were the same for all experiments. The networks were trained on eight GPUs with four images per GPU. In the ER setting, the batch size is also four per GPU, among which one is from the replay memory.
The ResNet-50 [60] is used as the backbone network for Faster-RCNN, and for the Sparse R-CNN, the PVTv2-B2 [59] is chosen. The backbones of the models were pre-trained on the ImageNet [62]. If not mentioned otherwise, all the default configurations are used from the reference implementations.
In the IT scenario, the model is trained for three epochs. The initial learning rate is 10 −4 , which decays with a factor of 10 for every epoch. In FT and ER, the added datasets were fine-tuned for one epoch with a learning rate of 10 −5 .
To prevent overfitting in the ER scenario, data augmentation is applied on images of replay memory. Four types of augmentation are used from the Image corruptions library [63], namely, motion blur, jpeg compression, Gaussian noise, and brightness. The augmentation methods were chosen so that they simulate common real-life scenarios.

Datasets
In total, we utilized four modern publicly available datasets: TableBank, PubLayNet, PubTables-1M, and FinTabNet. Table 1 summarizes datasets' statistical information. • TableBank [18] TableBank has been collected from the arXiv database [64], containing more than 417 K labeled document images. This dataset comes with two splits, Word and Latex. We combined both for training. • PubLayNet [65] PubLayNet is another large-scale dataset that covers the task of layout analysis in documents. Contrary to manual labeling, this dataset has been collected by automatically annotating the document layout of PDF documents from the PubMed Central TM database. PubLayNet comprises 360 K document samples containing text, title, list, figure, and table. All document samples from the PubLayNet dataset that contain tabular information were excluded for our experiments. • PubTables-1M [66] This dataset is currently the largest and most complete dataset that addresses all three fundamental tasks of table analysis. For our experiments, we include the annotations of table detection from this dataset that consists of more than 500 K annotated document pages. Furthermore, we unify the annotations for various tabular boundaries in this dataset with a single class of tables to conduct joint training. • FinTabNet [67] We employ FinTabNet to increase samples' diversity. FinTabNet is derived from the PubTabNet [19] and contains complex tables from financial reports. This dataset comprises 70 K document samples with annotations of tabular boundaries and tabular structures.

Evaluations
In this section, the numeric results of the experiments will be reported. As stated in Section 3, the four experiments are: independent training (IT), joint training (JT), fine-tuning (FT), and experience replay (ER). These experiments were performed with two famous models named Faster R-CNN and Sparse R-CNN for table detection.
In the IT, one model was trained for each individual train-set and tested against the corresponding test-set. In JT, the train-sets of all four datasets are shuffled, and the models are trained on the compiled set of samples. The resultant network is tested separately on the four test-sets of the four datasets. In FT and ER, one network was trained with the four datasets in sequence. Unlike IT, the model will not be tested until it finishes training with all four train-sets. The obtained network will then be tested on the four test-sets with the same order. It is expected that FT approach forgets the first dataset more severely. It should be mentioned that the effect of the order will be studied in a further subsection. As mentioned, the proposed method, ER, takes the same path as FT, with the difference that it consumes a subset of the previous datasets while being fed with the new samples. The four reported results are obtained in a similar manner to FT's. It is expected that ER suffers less from catastrophic forgetting and, ideally, reaches the performance of JT.

Evaluation Metrics
Being a sub-class of object detection, our task is evaluated with same performance criteria. The followings are the definition of the common evaluation metrics: •

Precision
The number of true positives divided by the total number of positive predictions is known as precision. This is a metric for determining the accuracy of a model. The precision is calculated through Equation (2): •

Recall
To indicate the rate of missed positive predictions, a metric named Recall (or Sensitivity) is commonly used. Recall could be obtained through dividing correct positive predictions (TP) by all positive predictions (TP + FP). Equation (3) shows the mathematical definition: •

Precision-Recall curve
For all possible thresholds, the precision-recall (PR) curve plots precision versus recall. A good object detector has a high recall rate as well as a high precision rate. • Intersection Over Union (IOU) As written in Equation (4), the overlap of a predicted bounding box versus the correct one for an object is measured by the intersection over union of the two boxes: •

Mean Average Precision (mAP)
The mean average precision (mAP) is a widely used parameter for evaluating object detection models. It is the area value under the precision-recall curve for each class, and the mAP is computed by averaging all average precision for all classes. Equation (5) formulates the metric: where AP r is the average precision r. Table 2 summarizes the results of these experiments. By taking a close look at the values of FT and ER, it can be observed that the proposed approach effectively prevents the models from forgetting previous datasets. To emphasize the contrast, parenthesized values in the ER rows show the mAP-gain by ER in comparison to FT. In particular, the mAP for the proposed method on the TableBank is about 15 percent higher than FT. We see that the Sparse R-CNN+PVT demonstrates a better performance than the Faster R-CNN+ResNet in almost all experiments.

Results
The precision-recall curves for ER and FT with the Sparse R-CNN+PVT architecture are presented in Figure 5. It is evident that as IOU thresholds rise, FT curves plummet. This is further illustrated with IOU values of 0.9 and 0.95 (in green and red, respectively). Moreover, the fact that the older datasets are more prone to forgetting is again apparent in the figures. This is evident from the TableBank's curves in Figures 5a,b and in Figures 5g,h that correspond to the most recent dataset. Figure 6 presents common pitfalls ahead of FT. In some cases, the model inaccurately detects the bounding boxes, and in others, we see frequent samples of false-positives. In contrast, the ER approach has led to a better performance and prevented the model from forgetting. (a) (b) Figure 6. The qualitative results using two methods: (a) Fine-tuning, (b) Experience replay. Blue represents true positive, and red denotes false positive. The samples are from TableBank, PubLayNet, and PubTables-1M datasets, respectively. The Experience replay method maintains the performance but the fine-tuning approach suffers from false detection and inaccurate bounding boxes.

The Effect of Datasets Order
It is clear that the inherent differences between the datasets is the cause of catastrophic forgetting. Nonetheless, the results showed that the performance drop of the network is harsher on the older samples. To investigate this, we repeated the experiments in Section 3 with a different sequence of datasets during the training phase. In the initial experiments, the order of the datasets was: TableBank, PubLayNet, PubTables-1M, and FinTabNet. However, for the second trial, the sequence is changed to PubTables-1M, PubLayNet, TableBank, and FinTabNet. The rest of the settings are equal to the previous experiments.
The results of this trial are presented in Table 3. These results support the previous ones, and the effect of forgetting is once again apparent. By comparing the results of the models on PubTables-1M in the first and second trials (Tables 2 and 3), we can infer that the performance of models on the preceding datasets drops more drastically. As is presented in Table 2, the effect of forgetting on the test-set of PubTables-1M is less than one percent, while in contrast, Table 3 shows a 1.1% improvement using ER over FT.

Comparison with State-of-the-Arts
The current SOTA methods heavily rely on particular datasets for training and evaluation. However, in this study, we conducted the experiments on multiple datasets in sequence. To this end, some of the datasets were altered, and the training procedures were different than is customary. Hence, results of this study are not directly comparable to the previous SOTA. Nevertheless, a few of them are reported for the convenience of the reader. Table 4 quotes the published values. While our methods trained in a continual setting, their results are close to the conventional methods that had been trained on a single dataset.

Conclusions and Future Work
Continual-learning-based methods have shown promising results in computer vision [16,47] and natural language processing [68]. In this paper, we present an incremental learning method for table detection in documents. To the best of our knowledge, this work is the first attempt to investigate the capabilities of a continual-learning-based method in the field of document image analysis. We conduct experiments with four different settings: independent training, joint training, fine-tuning, and training with an experience replay mechanism. While the conventional independent training is considered as the upper bound for performance, results with fine-tuning are taken as the lower bound. With the employed continual learning method (ER), we achieve a significant increase of 15% on the results achieved by fine-tuning. Furthermore, the performance from our method is comparable to previous state-of-the-art dataset-specific training methods.
Our evaluations of the modern table detection datasets demonstrate the potential to address the major challenge of relying on dataset-specific training in document image analysis. Moreover, it is an important step in developing robust table detection methods for different domains. We hope that this study serves as an inspiration for future research in continual learning for document image analysis.