U-SSD: Improved SSD Based on U-Net Architecture for End-to-End Table Detection in Document Images

: Tables are an important element in a document and can express more information with fewer words. Due to the different arrangements of tables and texts, as well as the variety of layouts, table detection is a challenge in the ﬁeld of document analysis. Nowadays, as Optical Character Recognition technology has gradually matured, it can help us to obtain text information quickly, and the ability to accurately detect table structures can improve the efﬁciency of obtaining text content. The process of document digitization is inﬂuenced by the editor’s style on the table layout. In addition, many industries rely on a large number of people to process data, which has high expense, thus, the industry imports artiﬁcial intelligence and Robotic Process Automation to handle simple and complicated routine text digitization work. Therefore, this paper proposes an end-to-end table detection model, U-SSD, as based on the object detection method of deep learning, takes the Single Shot MultiBox Detector (SSD) as the basic model architecture, improves it by U-Net, and adds dilated convolution to enhance the feature learning capability of the network. The experiment in this study uses the dataset of accident claim documents, as provided by a Taiwanese Law Firm, and conducts table detection. The experimental results show that the proposed method is effective. In addition, the results of the evaluation on open dataset of TableBank, Github, and ICDAR13 show that the SSD-based network architectures can achieve good performance.


Introduction
In recent years, driven by industry 4.0, data-driven enterprises have become a global trend. The pandemic has changed the way companies work, and contents on the internet and paper documents are growing exponentially. The demand for document digitization by commercial and non-commercial institutions (such as banks, enterprises, educational institutions, or libraries) is gradually increasing, which can greatly improve the availability of data. However, extracting text data manually is a complicated, time-consuming, and impractical process to obtain reliable information from paper documents [1][2][3]. Office Automation is essential to the contemporary pursuit of a paperless work environment, and in most cases, it is completed by scanning and mail transmission. As the number of documents to be processed increases, the demand for automatic text information extraction will also grow rapidly [4], and using Optical Character Recognition (OCR) to automate the process of text information acquisition can reduce human work and greatly improve the overall speed [5]. Although OCR performs automatic retrieval of information in the text area of a document, the current technology is suitable only for simple text data, thus, it will easily result in recognition errors for complex information [6,7]. Data formats are mainly structured, semi-structured, and unstructured. Structured data are data processed for easy analysis; unstructured data are unprocessed data, such as common emails, images, and PDFs; semi-structured data refers to data between structured data and unstructured data. In addition, due to the improvement and development of the Internet of Things Accurate edge information can minimize errors of table detection, facilitate subsequent OCR operation, and improve the overall process efficiency.
The contributions of this study include improved SSD and the end-to-end table detection model. VGG in the original SSD architecture was replaced with U-Net to increase edge features, and 8 additional convolution layers were added to increase feature extraction. In addition, the concept of dilated convolution was introduced to reduce information loss during feature transmission. Finally, six feature maps of different scales were selected to detect targets of different sizes. The edge prediction errors of the table are greatly reduced by the proposed method and the detection effect is remarkable. This experiment used the dataset of accident claim documents, as provided by a Taiwanese law firm, and the performance of proposed method was also evaluated on TableBank, Github, and ICDAR13 datasets. The results show that the network architecture based on SSD can achieve good performance.

Object Detection Methods and U-Net
Object detection can be divided into two types: One-stage and two-stage models. The earlier algorithms mostly used two-stage models, which separate object position detection and classification and have limitations in speed. One-stage object position detection and classification are carried out simultaneously to improve the speed of two-stage detection. The object detection algorithm based on Region Proposal is improved based on Fast R-CNN [29]. In 2016, Girshick et al. proposed Faster R-CNN [30], which can be divided into two parts: Region Proposal Network (RPN) and Fast R-CNN object detection network. The former is a fully convolutional network, and the latter uses the area extracted by RPN for object detection. The main process is to extract feature maps from input images through basic convolutional networks, generate related candidate boxes through RPN, transform feature images and candidate boxes into uniform size after RoI Pooling, and finally, input data to the classifier for classification and generate object position. Although the accuracy of Faster R-CNN is good in object detection, its calculation amount is large. Therefore, Joseph Redmon et al. proposed You Only Look Once (YOLO) [31], as based on a single neural network algorithm, which can simultaneously detect multiple positions and categories to achieve end-to-end object detection. Darknet can be used as the basic architecture to achieve lightweight, highly efficient, and good results in detection. The overall architecture of YOLO consists of 24 convolution layers and two fully connected layers, the activation function is Leaky ReLU, and the last layer is the linear activation function. The input image size is 448 × 448 pixels, and input images are divided into grids to detect the existence of target objects in each grid and predict the bounding box, as well as the probability of its category. As each grid generates multiple prediction boxes, Non-Maximum Suppression (NMS) is adopted to filter the redundant prediction boxes with a high repetition rate, and finally, the most appropriate prediction object is selected. YOLOv2 [32] uses the new network structure Darknet-19, the size of the input image is a multiple of 32, and multi-scale training is adopted to change the size of the output feature map to 13 × 13. Then, a Batch Normalization (BN) layer is added after each convolutional layer to improve convergence. In addition, the anchor box is introduced to predict the bounding box, and its offset position is predicted and obtained. Compared with YOLOv1, the overall speed is faster and the effect is better. With Darknet-53 as the backbone, YOLOv3 [33] has a total of 53 convolutional layers, a residual network is added to avoid the gradient caused by deepening the network, and the last layer of the activation function is Logistic. In addition, the Feature Pyramid Network (FPN) [34] is introduced to detect objects of different sizes using multi-scale feature maps, and the sizes of the three output feature maps are 13 × 13, 26 × 26, and 52 × 52, thus, YOLOv3 has better accuracy and speed than the previous two versions. Liu et al. proposed SSD [27] and a one-stage deep neural network for end-to-end object detection, where the output space of the bounding boxes was discretized into default boxes according to the different feature map positions of different scales and different aspect ratios. In the prediction stage, the network obtains the probability of the existence of each object category in each default box, as well as the box that best matches the object shape. The SSD network can detect objects of different sizes more effectively by combining multiple feature maps of different resolutions. Moreover, SSD does not need to generate proposals, pixels, or feature re-sampling stages, but encapsulates all calculations within a network, thus, SSDs have good accuracy even when the input image size is small. The main network structure of SSD is VGG16 [35], which replaces two full-connection layers with the convolution layer and adds four convolution layer network structures. Regarding the feature maps output by the five different convolution layers, the convolution layers of two 3 × 3 convolution kernels are used to process them separately. One of the convolution layers outputs the confidence for classification, and each default box generates an N + 1 confidence value. The other convolutional layer outputs regression localization. Thus, SSD is highly accurate in real-time object detection and highly flexible in a network architecture. Therefore, this paper chose this architecture as the core, replaced VGG16 with U-Net, and the edge information was merged and retained through the autoencoder feature.
The image segmentation network based on the encoder-decoder structure was proposed by Ronneberger et al. [28] in 2015, which is mainly applied in medical imaging. As the network structure is similar to U, it is called U-Net. Its network structure is a classical full convolutional network, which is divided into two parts. The first part is the Encoder, which performs feature extraction through down-sampling of the contracting path to obtain feature maps of different sizes, and a 32 × 32 feature map is output. The second part is the Decoder. After up-sampling and feature extraction of the expansive path, the overlap-tile strategy is used to merge with corresponding sizes to overcome the problem of feature loss during feature transmission. This paper introduced this architecture to enhance the ability of feature retention.

Dilated Convolution
In Fully Convolution Neural Networks [36], convolution is performed for data before pooling, and when the reduced feature map is enlarged back to its original size, part of the feature information is lost; therefore, Long et al. proposed dilated convolution [37] for image segmentation. As dilated convolution can add reception fields without pooling loss information, each convolutional output contains a wide range of image information [38,39]. Dilated convolution can achieve good results when the image requires global information or a voice text requires a long sequence [40][41][42]. In order to fully collect more features, dilated convolution uses the expansion rates of different sizes for convolution layers of different scales without increasing the number of parameters [43]. In addition, dilated convolution can play an effective role in image recognition, object detection, and image segmentation [44][45][46]. Therefore, this study added dilated convolution to compensate for slightly lost feature information and maximize the retention of table features.

Table Detection
Kim and Huang [47] proposed a rule-based detection method for application to text and web pages, which was divided into two stages; first, features were extracted from the table, and then, grid pattern recognition was carried out by tree structures. Kasar et al. [48] proposed the use of a SVM classifier to divide an image area into a table area and a non-table area, and then, detected crossed horizontal lines, vertical lines, and low-level features to achieve the extraction and recognition of table information. The application of deep learning in an image has good performance, which makes table detection have better robustness and development. In recent years, Gilani et al. [23] proposed the use of Faster R-CNN to detect tables. Schreiber et al. [24] proposed that DeepDeSRT could conduct table detection and structure identification, as based on Faster R-CNN, to identify the positions of rows, columns, and cells in a table. Yang et al. [49] proposed a vision that treats documents as pixels and extracts semantic structures using Multimodal Fully Convolutional Neural Networks. Paliwal et al. [50] used an encoder-decoder combined with VGG as the infrastructure to detect tables and columns, where the two shared the encoder. Decoders train in two separate parts; first, they obtain the result of table detection from the model, and then, obtain columns from the table area through the semantic method. Li et al. [18] proposed a feature generator based on a Generative Adversarial Network (GAN) for table detection to improve the performance of table detection with fewer rules. Huang et al. [51] proposed YOLOv3 as the basic architecture for table detection, and adaptive adjustments and optimization according to specific characteristics. Riba et al. [52] used the image-based method to detect tables in documents and extract the repetitive structure information in the documents by focusing on the structure without relying on words. Prasad et al. [15] proposed that CascadeTabNet use Cascade-RCNN combined with transfer learning, and data enhancement to improve the process and detect tables and their structures.

Proposed Method
The model architecture proposed in this paper is based on SSD, and the original VGG model was improved by U-Net, which is called U-SSD, as shown in Figure 1. In the data input layer, the size of 300 × 300 was maintained, and VGG16 was replaced by U-Net in the basic network layer. As the model architecture of VGG is highly similar to the feature extraction part of the contracting path (Encoder Part) of U-Net, feature extraction was carried out by the convolution layer. In the process of downsampling, some image edge information will be lost during the convolution operation. The other part of the U-Net network architecture is the expansive path (Decoder Part). In the decoder stage, the overlap-tile strategy was used to mirror the feature map, and the feature map, as generated by upsampling, was combined with the feature map of the same size, as generated in the encoder stage. In addition, in order to greatly reduce feature loss during feature transmission and complete the lost image edge information, multi-scale feature fusion was used to splice feature dimensions. Therefore, replacing VGG16 with U-Net improved the effect of feature map extraction. In the feature layer of the latter half of the original SSD, Conv4_3 and the output of the last layer in VGG16 are used as the starting point of the multi-scale feature map. In the latter part of the feature layer of U-SSD, as part of image information would be lost in the process of feature map scaling down and amplification in the U-Net encoder stage, this study added two additional convolutional layers, for a total of 8 convolutional layers (originally 6 in SSD), as shown in Figure 2. In addition, similar to the original SSD, this study maintained the last six layers of feature maps with different scales (38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, 1 × 1) for detection of different scale targets and field of vision, and there was a L2 normalization layer after Conv13. The loss function and ground-truth matching optimization are based SSD. The loss function L is a weighted sum of the localization loss (loc) and the confidence loss (conf), seen in Equation (1). x p ij = {1, 0} is an indicator for matching the i-th default box to the j-th ground truth box of category p. N is the number of matched default boxes. The localization loss is a smooth L1 loss L loc between the predicted box (l) and the ground truth box (g) parameters. The weight term α is used for cross validation. The confidence loss L con f is the softmax loss over multiple classes confidences (c).
L(x, c, l, g) = αL loc (x, l, g)  Smooth L1 loss can avoid the gradient explosion caused by the wrong offset when the ground-truth and predicted box are too different. The L loc is defined in Equation (2). We regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h). The confidence loss is defined in Equation (3).

Dilated Convolution
This paper added dilated convolution to Conv13 and Conv14, as shown in Figure 2. Dilated convolution increases the perceived field of convolution without reducing the spatial feature resolution or increasing model and parameter complexity, meaning it compensates for the lost feature information in the first half and adds more global information to the feature map. Figure 3 shows the difference between ordinary convolution and dilated convolution. The blue dots represent the 3 × 3 convolution kernel, and the yellow area represents the perceived field of vision after the convolution operation. Figure 3a shows the general convolution process, which is preset to 1 and performs the sliding operation closely on the feature map. Figure 3b shows that the dilated convolution is 2, and the perceived field of vision is 5 × 5. Figure 3c shows that the dilated convolution is 3, and the perceived field of vision is 7 × 7. Figure 3 shows that the benefit of dilated convolution is that without pooling loss information, the perceived field of vision is enlarged and each convolutional output contains a wide range of information. In this paper, the values of dilated convolution added to Conv13 and Conv14 were 6 and 5, respectively.

Dataset and Data Pre-Processing
The training and testing dataset used in this study were provided by a Taiwanese Law Firm, with a total of 1378 tabular data, including diagnostic certificates, accident site maps, and preliminary accident analysis tables. The data are image files, which are mostly composed of large object tables and a few text areas, include scanned images and filmed images. The images were shot from different angles and under different lighting. The cutting ratio of training, validation, and test data was 8 Github, and is composed of text areas and charts, and most of the tables have no outer frame. The ICDAR13 dataset, which is used for table detection and structure recognition, is provided by the International Conference on Document Analysis and Recognition, and is composed of PDFs, and documents must be converted into images for use in the research model. Data from the EU and US governments consists of text areas and charts, which contain 150 tables and a total of 238 charts.
The quality of data determines the quality of model training, thus, in order to make a model more accurate, data are often transformed and pre-processed before model training, and image size, rotation, and color conversion are often used. As the dataset of this study was symmetrical and neat, the influence of the commonly used methods on this study was limited. According to Prasad, D. et al. [15], this paper proposed data conversion for document images, including the text, tables, and blank areas in the document; therefore, the model can better understand the data by thickening the text areas and reducing the blank areas. This paper adopted the image dilation method for image transformation, and the black part in the original image was dilated. Before the dilation, the original image was converted into a gray scale image, binarization was carried out, and the dilation image was generated, as shown in Figure 4.

Evaluation Method and Parameter Settings
As it can calculate the overlap range between the predicted target and the ground-truth target, Intersection over Union (IoU) is one of the basic measurement standards for object detection. When the overlap area is larger, the IoU value is higher, which indicates a more accurate prediction, as shown in Figure 5. As it is actually quite a strict index, the results will be relatively poor if there is a slight deviation. In general, good prediction results can be obtained if the IoU is greater than 0.5. True Positive (TP) is actually positive; if it is predicted to be positive, the prediction result is correct. False Positive (FP) is actually negative; if it is predicted to be positive, the prediction result is wrong. False Negative (FN) is actually positive; if it is predicted to be negative, the prediction result is wrong. True Negative (TN) is actually negative; if it is predicted to be negative, the prediction result is correct. Precision and recall are used as the evaluation indices in this paper.
There is a tradeoff between precision and recall. Usually, when precision is high, recall is low, and vice versa. When there is a contradiction between the two indices, the F1-score is the most commonly used measurement index, as shown in Equation (4), as it can fully consider precision and recall, especially in the case of category imbalance.
The Precision-Recall Curve takes Precision as the Y-axis and Recall as the X-axis, and each point represents Precision and Recall under different threshold values. Average Precision (AP), which calculates the average accuracy of a single category, is the area under the PR curve.
The experimental environment settings are shown in Table 1. This paper scaled the image input of the dataset to 300 × 300 pixels, the same as SSD, and used the Adam optimizer. After 100 times of Epochs training, the learning rate of the first 50 times was 0.0005, and the batch size was 8, and the learning rate of the last 50 times was 0.0001, and the batch size was 4.

Evaluation
This paper tested three different datasets: TableBank, Github open dataset, and ICDAR2013, as well as the dataset provided by a Taiwanese law firm, and compared them with the Faster R-CNN, YOLOv3, and SSD detection models. Faster R-CNN was divided into two types with VGG and ResNet50 as the backbone. In addition, the combination of YOLOv3 and U-Net was compared with the U-SSD model, as proposed in this paper.

Evaluation on Our Dataset
The dataset provided by a Taiwanese Law Firm was evaluated, and the results of the model under IoU = 0.5 are shown in Table 2. Regarding precision, while the one-stage object detection model was much higher than the two-stage model, the difference was not significant. On the contrary, while the recall of the two-stage model was relatively high, the values were not far apart. In terms of the comprehensive performance of F1-score and AP, when IoU is 0.5, all the above six models can perform well, among which U-SSD received good scores. The Precision-Recall Curve is shown in Figure 6.
Due to the large difference between the Table and the natural object, the higher the overlap rate between the predicted position and the real position, the better. Therefore, the results of the model under IoU = 0.8 are quite important, as shown in Table 3. The one-stage model performed better than the two-stage model in precision, recall, F1-score, and AP. The U-SSD model, as proposed in this paper, performed better than the other models in comprehensive comparison. The Precision-Recall Curve is shown in Figure 7.
Generally, as the IoU threshold increases, the performance of detection decreases; however, the smaller the position offset of the predicted table, the higher the accuracy, which is more important for Table detection, as shown in Table 4. The average precision of U-SSD can achieve over 90% performance at IoU = 0.5 to 0.9. Figure 8 shows that when the IoU threshold was set below 0.7, the performance of all models was relatively average. As the threshold was increased, with the exception of U-SSD, the other models obviously decreased, while U-SSD remained relatively stable and accurate, which shows that its performance is more robust compared with other models.

Evaluation on Open Dataset
The layout of documents will vary according to the editor's style. The layout of public datasets are mostly composed of text areas, tables, or pictures, among which the styles of ICDAR2013 are more diverse, while the data of this study mostly consist of large object tables and a few text areas, similar to TableBank and Github. Therefore, the detection results of ICDAR13 were slightly poorer than those of the other three datasets, as shown in Table 5. Although the validation results of the proposed model in public datasets show that there was no significant effect at the IoU threshold of 0.7, there was little difference between the models. When IoU was 0.9, the F1-score of each model decreased significantly, and as the IoU threshold increased, the detection performance decreased. Thus, it is very important for the table to predict as little position offset as possible and as accurately as possible. Our model was almost higher than the other five models when the IoU was 0.9, and the average IoU was quite close to the optimal result. As can be seen from Figures 9-12, U-SSD (brown line) has a lower decline than the other five models, and this result illustrates that the stability and model detection performance of U-SSD are very good, and the network architecture based on SSD shows good performance.

Ablation Experiment
Dilated convolution increases feature information without losing resolution, and U-Net has great influence in the field of image segmentation, as it can extract feature maps through the encoder and combine the obtained feature maps with the original maps in the decoder to reduce feature loss. This paper evaluated whether adding dilated convolution and improving U-Net under the SSD architecture would have an effective impact on the model. The evaluation was based on the dataset provided by a Taiwanese Law Firm, and used the IoU thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, and the results show that the detection capability decreased as the IoU threshold increased. Table 6 shows that adding dilated convolution using SSD improved the result, and if the main SSD structure VGG is replaced by U-Net, the result will drop slightly. U-Net loses some edge features in the process of zooming in after the feature map size is reduced, which has great influence on table detection. Therefore, when dilated convolution was added and the convolution perception field was improved, feature loss was complete. The experimental results show that when VGG in SSD was replaced by U-Net and dilated convolution was added, the result F1-score under the average IoU reached 0.94. When the threshold was increased, its detection ability did not decrease too much, but remained at a certain level; therefore, the robustness and performance of the proposed model are good.

Performance of Inference Time
The two-stage model consumed more time in detection speed than the one-stage model. Table 7 shows the detection time of different models. It can be seen that the detection time of models based on Faster R-CNN was longer than that of models based on YOLOv3 and SSD. Although the detection time of YOLOv3+ U-Net and U-SSD were increased by adding U-Net to the original model, compared with Faster R-CNN, it still had a higher detection speed.

Conclusions
This paper proposed an end-to-end network model for table detection, U-SSD, which was improved by U-Net that is a classical image segmentation model. Edge information was added by combining feature maps, and dilated convolution was added to increase the perceived field of feature extraction to supplement the lost features. In addition, the dataset provided by a Taiwanese Law Firm was used as the training sample, thus, the object detection model and image segmentation model were initially trained with natural data. The experimental results show that the improved U-SSD can further improve the accuracy of table detection and minimize the prediction error at table edges, and the public dataset verification results show that the detection effect is good. Therefore, using an image segmentation model for object detection can also achieve good results, and adding dilated convolution can effectively improve feature information.
This paper focused on tables with large objects, and the layout was mostly tables with a few text areas; therefore, the detection of small object tables, cells, text areas, and charts in documents was limited. In the future, we will extend the optimization model to include text areas, different styles of cells, and pictures, in order to detect small object cells and further identify table structures in documents.