DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining

: The structure of a document contains rich information such as logical relations in context, hierarchy, afﬁliation, dependence, and applicability. It will greatly affect the accuracy of document information processing, particularly of legal documents and business contracts. Therefore, intelligent document structural analysis is important to information extraction and data mining. However, unlike the well-studied ﬁeld of text semantic analysis, current work in document structural analysis is still scarce. In this paper, we propose an intelligent document structural analysis framework through data pre-processing, feature engineering, and structural classiﬁcation with a dynamic sample weighting algorithm. As a typical application, we collect more than 11,000 insurance document content samples and carry out the machine learning experiments to check the efﬁciency of our framework. Meanwhile, to address the sample imbalance problem in the hierarchy classiﬁcation task, a dynamic sample weighting algorithm is incorporated into our Dynamic Weighting Structural Analysis (DWSA) framework, in which the weights of different category tags according to the structural levels are iterated dynamically in training. Our results show that the DWSA has signiﬁcantly improved the comprehensive accuracy and the classiﬁcation F1-score of each category. The comprehensive accuracy is as high as 94.68% (3.36% absolute improvement) and the Macro F1-score is 88.29% (5.1% absolute improvement).


Introduction
With the rapid growth of insurance policies, consumers and insurance industry practitioners need to effectively extract useful information from a vast number of insurance documents. For time and manpower consuming activities such as identifying the key terms and differences between similar policies, intelligent document processing methods are highly desirable. Various document processing techniques have been developed, allowing electronic documents to be processed more efficiently [1][2][3][4]. In recent years, machine learning methods have been widely used in data mining and information extraction and received great success. However, in the specific field of insurance industry, Chinese insurance documents pose extra challenges because most exist in unstructured forms and the text features are often mixed with some noise, leading to failure of feature recognition when processing documents. Therefore, further work is needed to automatically process unstructured insurance documents to extract key information and transform them into structured data without losing important features.
This paper proposes an intelligent information extraction framework for Chinese insurance documents. The proposed model can provide a convenient technology platform for insurance practitioners or related researchers. In addition, a dynamic sample weight algorithm is proposed to address the sample imbalance problem.

Related Studies and the Current Contribution
Machine learning methods are widely used in data mining and document analysis tasks [5][6][7][8][9][10]. Automatic classification systems have been applied to mail classification, electronic conference, information filtering, and so on [11][12][13][14][15]. The Xgboost algorithm [16] implements the GBDT algorithm and has been widely used in competitions and many other machine learning projects and achieved satisfying results [17,18]. The Lightgbm algorithm [19] supports efficient parallel training, which processes data quickly and also performs well. Gómez et al. [20] proposed a model that when given an initial UML model, a number of alternatives suitable for data structures can be generated semi-automatically. Liu. et al. [21] provided a comprehensive survey of the progress that has been made in the field of document image classification over the past two decades.
In terms of sample imbalance, many researchers use the methods of over-sampling and under-sampling to solve the problem of sample imbalance [22]. Lin et al. [23] proposed a new loss function: Focal loss. The loss function is modified based on the standard cross entropy loss. This function can make the model focus more on difficult samples in training by reducing the weight of easily classified samples.
To address the serious challenge of information explosion, automated tools are needed to help people quickly extract specific factual information from a large number of information sources. [24][25][26][27][28]. A lot of works have been done in the field of entity recognition and extraction, knowledge acquisition [29][30][31][32], and text analysis and reasoning, etc. [33][34][35][36]. Quan et al. [37] proposed a paper classification and information extraction system in the computer science field. The team used the Naive Bayes algorithm to automatically classify a large number of papers and extract relevant information. In terms of algorithm, a new Weighted Naive Bayes model is developed to better fit the data model. The data set only contains the paper abstract. The documents are not in PDF format and are relatively easy to process. Still, the accuracy of the information extraction system cannot reach the application standard.
Compared with English language processing techniques, Chinese language has a lot of characteristics different from English [38]. For example, there is no natural boundary between words in Chinese, so before the classification, the text should be divided into words. In addition, the proportion of syntactic analysis and semantic analysis is different in different languages [39,40]. Therefore, Chinese language processing needs additional data pre-processing work. At present, the main difficulty in Chinese document processing is that the accuracy or F1-score of intelligent recognition and classification is not high enough to achieve the target performance. As a result, the accuracy and efficiency of information extraction cannot achieve the application standard.
The whole task of Chinese insurance document analysis can be divided into the following parts: (1) Data pre-processing and data cleaning are conducted at first. Most insurance documents exist in PDF format, and all document contents need to be recognized and converted. (2) An algorithm model needs to be built for structural classification and information extraction. (3) Structured documents are the output, including the retypesetting of the original text, tables, and other information in the document. Accordingly, we design the Dynamic Weighting Structural Analysis (DWSA) framework. Compared with related works, the proposed method is first applied in the Chinese insurance document field, and the dynamic sample weight algorithm can improve the performance of the structural classification and address the problem of sample imbalance effectively.
The main contributions of this paper are summarized as follows.
• This paper proposes an intelligent information extraction framework for insurance documents. We use the techniques of data pre-processing, feature engineering, and structural classification to process the documents. The proposed model can provide a convenient technology platform for insurance practitioners or related researchers.
• A dynamic sample weight algorithm is proposed to address the sample imbalance problem and improve the structural classification part of the DWSA framework. The comprehensive accuracy reaches 94.68% (3.36% absolute improvement). And the Macro F1-score achieves 88.29% (5.1% absolute improvement).

DWSA Framework
In this work, first an intelligent structural analysis (SA) framework is proposed to help people to survey a large amount of Chinese insurance documents. Secondly, a dynamic weighting (DW) classification algorithm based on Adaboost is proposed to address the sample imbalance problem and improve the classification part in the structural analysis framework. Through the structural analysis, the whole document contents are divided into several hierarchical categories by the machine learning algorithm to facilitate subsequent structured output containing key information in the document. The framework essentially includes three parts: data pre-processing, feature engineering, and structural classification with a dynamic sample weighting algorithm.
In the data pre-processing part, we modify the Pdfplumber framework for PDF document information recognition and conversion and use JIEBA framework to carry out word segmentation. We convert the content into sentence-level text and text location information such as distance from page top and left. Then we remove the punctuations, the logos, and stop words in the text. The stop words are similar to 'the', 'is', 'at', 'which', 'on', etc. The stop words contain no insurance information and act as a sort of noise for structural classification. We generate a stop words list to identify and remove the stop words.
In the feature engineering part, we manually label the sentences according to the structure of the insurance document, such as level-1 heading, level-2 heading, content, etc. We also do context feature fusion which mainly incorporates features of the previous sentence into the current sentence features. See the detail of the context feature fusion in Section 2.2.
In the structural classification part, unlike the well-studied field of text semantic analysis and text classification, we carry out structural classification based on the sample features obtained in feature engineering. The sentences are classified into five structural categories: useless(-1), content(0), level-1 heading(1), level-2 heading(2), and level-3 heading(3). The data set details are shown in the Table 1. Through the classification, the content of the documents is segmented and divided into a lot of smaller parts. We store the content under the corresponding heading and save it as the dictionary data type. The key is the heading and the value is the corresponding content. With the key-value structure of the dictionary data type, we can extract the core information and output structural documents. (See output details in the Section 2.3).
In this structural classification part, the sample amount of each category is often quite different. The sample imbalance problem will thus occur. Even if the classification model is obtained, it is easy to over-rely on the limited data and lead to overfitting problems. When the model is applied to new data, the accuracy will not be high enough. To address this sample imbalance problem, dynamic weighting coefficients of each category are incorporated into the DWSA to conduct the classification task. We use accuracy and F1-score [41] to evaluate the model. The dynamic sample weighting algorithm adjusts the weight coefficients of each category. The weight coefficients of different categories are updated iteratively according to the classification F1-score and the feature importance. When the F1-score of the category with few samples is low, the algorithm will automatically assign the larger weight. Also, when the F1-score of other categories is affected and decreases, the algorithm will optimize the weight allocation to achieve the globally optimal result.

Approach
As shown in Figure 1, in this paper, we need to input a large number of unstructured Chinese insurance documents. The documents are in PDF format and are shown in appendix file Figure A1. The document content includes product name, title, insured liability, insured age, etc. Through the intelligent structural analysis, structured output of information is obtained. The specific steps are as follows: Figure 1. The framework of Dynamic Weighting Structural analysis (DWSA). The DWSA framework mainly includes three parts: data pre-processing, feature engineering, and structural classification.

Data Pre-Processing
The existing insurance documents are mostly saved in PDF format, and document content information needs to be converted before structural analysis operation. In this paper, we modify the Pdfplumber algorithm framework for PDF document contents conversion and convert the document content into sentence-level text. In terms of the treatment of broken sentences, a large number of sentences become incomplete sentences in the recognition and conversion step. The modified Pdfplumber framework can fit in this situation. For the insurance field, we do sentence segmentation based on semantic information. The broken sentences need to be treated differently in different circumstances.
Then, we do word segmentation and remove the punctuations and stop words in the text. There is no natural boundary between words in Chinese, so before the classification, the text should be divided into words. JIEBA uses a Chinese thesaurus to determine the probability of association between Chinese characters and can do word segmentation. When we use JIEBA to do Chinese word segmentation, all the words in the sentence are divided. The stop words contain no insurance information and constitute a sort of noise for structural classification. For stop words, we generate a stop words list and we use the list to identify the stop words and remove the stop words by programs.
The modified Pdfplumber framework is also used to extract document content features. The features include information such as font size, the font, coordinate position, and context spacing, etc. The extracted document feature information is further processed in depth in the data cleaning and feature engineering.

Feature Engineering
After data pre-processing, document content containing various feature information is obtained. Also, the Chinese word segmentation is carried out using the JIEBA framework. We label the key information such as headings at all levels and core attributes. The details of the labeled data set are shown in Table 1. After the conversion, the document data format is shown in Figure 2. The original content is in Chinese and we translate it into English as an alternative for better understanding. See the appendix file Figure A2 for the original Chinese content.
In addition to sentence-level content, we also get other feature information. These features are important for structural classification. For example, as shown in Figure 2, the document content belonging to the level-1 heading can be significantly different from contents of other levels in terms of font and font size features. Therefore, we can distinguish the level-1 headings by font size and other features. Table 1. Details of the data set. The data set includes five categories and has 11,300 samples.

Class Content Number
Useless content in the document Useless(-1) such as footer, special symbols, catalogue, etc. 2469 Content (0) The main and useful content of the document 6704 Level-1 Heading (1) The level-1 heading in the document 469 Level-2 Heading (2) The level-2 heading in the document 1575 Level-3 Heading (3) The level-3 heading in the document 83 The file format in the data pre-processing step. We modify the Pdfplumber algorithm framework for PDF document information conversion, which can get more features in document content. The features mainly include size, font family, top, company, etc. The features named top, left, width, and height mean distance from page top, distance from page left, total width, and the total height of the sentence, respectively.
Also, context feature information can improve the performance of the structural classification. We correlate the feature information of individual words and sentences with multi-sentence and multi-word information. As shown in Figure 3, X 2 is the feature before the context feature fusion. X 2 will fuse with the feature of the previous sentence X 1 and generate the feature vector. Based on the existing features such as: left, top, height, etc., we obtain the context features named preleft, pretop, etc. which contain the content and location information of the previous sentence. The features of the previous sentence can provide additional useful information to the training sample. We vectorize the feature information and concatenate the context features and the current sentence features. Each sentence has many features. When we do vectorization, each feature counts as one dimension. Therefore, we use the context features to increase the dimension of the feature information. The context information can be supplemented and enhanced to make the machine learning algorithm learn and fit the data better.

Dynamic Weighting Algorithm
Unlike the well-studied field of text semantic analysis and text classification, in this part, based on the sample features obtained in feature engineering, we need to do sentencelevel hierarchy classification for document content. In general, the insurance document will have three levels of headings. Also, we need to separate useless content from useful content in document content. Therefore, the sentences with features in documents are classified into five categories: useless(-1), content(0), level-1 heading(1), level-2 heading(2), and level-3 heading(3). Besides semantic information features, in feature engineering we also extract other features such as text coordinate information, font, font size, etc. Therefore, the language model is not effective for our task. We propose the Dynamic Weighting Algorithm derived from the Adaboost algorithm. Adaboost is an algorithm that can upgrade a weak learner to a strong learner. The working mechanism of the algorithm is as follows. • An initial base learner is trained with the training set with equal weight of each sample. • The sample weights of the training set are adjusted according to the predicted performance of the learner in the previous round. The algorithm increases the weight of the misclassified samples so that they will receive more attention in the next round of training. The base learner with a low error rate has a high weight, while the base learner with a high error rate has a low weight. Then train a new base learner. This structural classification task needs to divide document content information into 5 categories. The number of samples of each category is quite different. The details of the labeled data set are shown in Table 1. There are relatively more 'content' and 'useless' samples, while there are relatively fewer level-1, level-2, and level-3 heading samples. In the ordinary unweighted Adaboost algorithm, the weight of each category is the same and fixed. The F1-score of the category with fewer samples is lower. This has a significant impact on the subsequent structured output. If the title is not accurately classified, it is difficult to properly distinguish the document content.
To address this problem, we improve the classification algorithm and propose the Dynamic Weighting algorithm. The theoretical formula of the algorithm is as follows: 1. Initialize the sample weight D 1 .
where w 1i means the sample weight, N is the number of samples.
where α means the category weight value.

Training base learner
For m = 1, 2, ..., M, repeat the following operations to get M base learners. We use the Decision Tree algorithm as the base learners in this task.
(1) Corresponding to the m th sample weight D m , the m th base learner G m (x) was obtained.
(2) Calculate the classification error rate e m of G m (x) on the weighted training data set.
where I means the event probability, G m (x i ) means the base learner, y i means the labeled category.
(3) Calculate the coefficient of G m (x).
where e m is the error rate of G m (x), a m is the weighting coefficient of G m (x), and a m increases with the decrease of e m . (4) Update the weight of the training sample.
where M means the number of learners, N means the number of samples. Z m is a normalization factor, making the sum of the weights corresponding to all samples equal to 1. (5) Update the weights of the categories.
where ϕ means the F1 score calculation function, tag j,i means the samples in different categories.
According to Equation (4), when the error rate of G m (x) of the base learner is e m < 0.5, a m > 0, and a m increases with the decrease of e m . The proportion of the base learner with lower classification error rate is larger in the final integration. The algorithm model is able to adapt to the training error rate of each weak classifier.
According to Equations (2) and (7), the weight is initialized first according to the number of various category tags. In the model training process, according to the iteration of the algorithm, the weight value is fine-adjusted with the learning rate γ, and we set a threshold β to achieve the optimal weight allocation and achieve the highest comprehensive F1 score. If e m > β, α j = α j + γ. The computation complexity of dynamic weighting coefficients is O(N). The flow chart of the algorithm is shown in Algorithm 1. The experimental results and weight coefficients are shown in Section 3.
As shown in Figure 4, through the classification, the content of documents is segmented and classified according to its hierarchy. The data processed by the algorithm model is reorganized to generate the output. We save the content under the corresponding heading and save it as the dictionary data type. The key is the heading and the value is the corresponding content. The headings at each level are nested. In content extraction, the content can be located quickly through the nested relation of headings at all levels. For example, we can see in Figure 4, 'Insured liability' is the level-3 heading saved in level-2 heading, and the contents belong to the heading are shown on the right. When several similar insurance documents need to be compared, DWSP can be used for structural comparison. The example shows the comparison result of three insurance documents. if e m > β then 18:   Figure A3 for the Chinese output).

Results and Discussion
In the experiments of this paper, 11,300 data samples are collected from insurance companies, and we use about 9040 samples (80%) as the training set, and use the 2260 samples (20%) as the test set. We first process the dataset through data pre-processing and feature engineering. Then, we conduct the classification algorithm comparison in experiments. The experiments in this paper are supervised learning tasks for machine learning and all samples are labeled.

Results of Some Existing Algorithms
The experimental samples are Chinese insurance document contents with features. The feature representation of each sample has strong attributes such as high dimension, nonlinearity, and data sparsity. For the classification task, the Support Vector Machine (SVM) algorithm, the Xgboost algorithm, the Lightgbm Algorithm, the TextCNN Algorithm, and the Adaboost algorithm are selected for comparison of classification accuracy and efficiency. The performance of each algorithm is evaluated by the comprehensive accuracy and the classification F1-score of each category. Finally, the results of the algorithms are compared. Figure 5 shows the total accuracy of each algorithm. The accuracy of the Xgboost algorithm reaches 91.93% and the TextCNN algorithm achieves 91.56%. The TextCNN algorithm is mainly used to extract semantic information of text content, which is not effective in our structural classification task. In addition, due to the imbalance of samples, the Macro F1-score of the TextCNN algorithm is low. The SVM algorithm has lower accuracy of 89.88%. The accuracy of the Lightgbm algorithm is 92.28%, and that of the Adaboost algorithm is 91.32%. In the second stage of the experiment, we modify the Adaboost algorithm and add dynamic weight calculation, consequently the accuracy of the DWSA algorithm is increased to 94.68%. Also, the statistical comparison between the proposed method and other algorithms is conducted. As shown in Table 2, we calculate the P-value and do the Bonferroni correction. The significance level k is 0.05 and a total of 15 groups are compared in pairs. Therefore, after the Bonferroni correction, the significance level k = k/15 = 0.0033.  To test the generalization ability of our proposed classifier in DWSA in other fields, we use the public data set Baike2018qa [42] and Iris [43] to conduct the comparison. Please note that unlike the well-studied field of text semantic analysis and text classification, published work on document structural analysis is still scarce and public data sets similar to insurance documents that contain rich hierarchical information are hard to find. The Baike2018qa contains Chinese question-and-answer texts in various fields, which can be used for classification at sentence level. However, the Baike2018qa data set only has semantic information without other features such as font size, the font, coordinate position, and context spacing, etc. We select a subset of the data set. There are five categories ('health', 'game', 'life', 'education', and 'entertainment') with unbalanced samples. The Iris data set contains 3 categories, with each sample containing 4 features, which can be used for classification. The result is shown in Figure 6. The obvious better performance on our collected insurance data set than that on the public data sets can be attributed to the effective data pre-processing and feature engineering in our framework. In addition to the standard of comprehensive accuracy, we also introduce the Macro F1score and the classification F1-score of each category. In the experiment of this paper, there are more samples of body content(0) and useless content(-1), while there are fewer samples of level-1 heading(1), level-2 heading(2), and even fewer samples of level-3 heading(3). Therefore, in algorithm training, the comprehensive accuracy sometimes cannot reflect the real classification results well. We introduce the Macro F1-score and the classification F1-score of each category to better analyse and improve the performance of the algorithm. As shown in the Table 3, the performance of the algorithms is slightly different, but the body content(0) and the useless content(-1) are all better in the training of the algorithms due to a large amount of sample data. Among them, the Adaboost algorithm is 92% in the F1-score of these two categories. In terms of the F1-score of level-1 heading(1) and level-3 heading(3), the F1-score of the algorithms is lower than that of body content(0). Therefore, we improve it in the second stage experiment. Table 3. Experimental results of insurance data set include precision, recall, and f1-score. The SVM algorithm, the Xgboost algorithm, the Lightgbm algorithm, the TextCNN algorithm, and the Adaboost algorithm are selected for comparison. The bold numbers are the best results.

The DWSA Experimental Result
In the second stage experiment of this paper, based on the previous results, the proposed DWSA model is used.
We have added the dynamic weight algorithm technique to the existing algorithm (the theory in Section 2). We give level-1 heading(1) and level-3 heading(3) the more weight. The calculated weight of each category: α −1 = 1.1, α 0 = 1, α 1 = 1.4, α 2 = 1.2, α 3 = 2.8. In the second stage experiments, different weight values will be obtained according to different sample numbers and feature weights of each category. Based on the principle of iterative updating parameters with the algorithm, the weights of training samples were dynamically updated together with the weights of each classification category to obtain better training results. The learning rate γ is 0.1. The experimental results are shown in the Table 3 and Figure 7.  (3), the F1-score is increased by 6.27%. The accuracy of the level-2 heading(2) is improved by 5.49% and reaches 94.99%. The weights of useless content(-1) and body content(0) are reduced relatively, but the accuracy is still increased by 0.32% and 4%, and the final accuracy achieves 92.8% and 96.5%.
It can be seen that DWSA has improved significantly in terms of comprehensive accuracy and F1-score.
According to the data shown in the Figures 5 and 7, and Table 3, the accuracy of the DWSA algorithm is improved from 91.32% to 94.68%. In terms of the F1-score of each category, the level-1 heading(1) and the level-3 heading(3) have the most obvious improvements, with increases of 10.72% and 6.27% respectively. The F1-score of the level-2 heading(2) is improved by 5.49% because of the high base value (89.5%). The final F1-score achieves 94.99%. In the DWSA algorithm, the weights of useless content(-1) and body content(0) are reduced relatively, but the F1-score still increases by 0.32% and 4% relatively, and the final F1-score achieves 92.8% and 96.5%, respectively. The results show that according to the dynamic weight algorithm principle of iterative updating, the algorithm model can flexibly adjust the category weight. Even if the weight is reduced, the category with a large number of sample data will get a reasonable weight eventually and keep the F1-score stable or slightly improved. Therefore, the results still have good performance.
Also, as shown in Figures 8 and 9, we visualize the feature importance and plot the ROC (Receiver Operating Characteristic) curves of the Adaboost algorithm and the DWSA algorithm. By comparing the ROC curves, we can see that the improved DWSA algorithm has better performance. The DWSA algorithm has a larger AUC (Area Under Curve) and a smoother ROC curve.  . The feature importance of data. F0-F16 respectively represent the features, including size, count, content, font-family, top, left, width, page, height, company, pre-left, pre-size, part, pre-font, total-part, pre-top, and pre-behind. The feature importance of 'page', 'part', and 'total-part' is too low to show in the figure; 'size' means the font size, 'count' means the word and punctuation count, 'content' means the content in documents, 'font-family' means the font, 'company' means the insurance company, 'width' means the total width of the sentence, 'page' means the page number, 'height' means the total height of the sentence, 'part' means the order of the sentence in a paragraph, 'total-part' means the number of sentences in a paragraph, 'top' and 'left' mean the coordinate position of each sentence in the page, 'pre-left', 'pre-size', 'pre-font', and 'pre-top' are the corresponding features of the previous sentence. ('pre-left', 'pre-top' and so on are just the names we call the features). 'pre-behind' means the distance from page bottom of the previous sentence. The previous sentence features are incorporated into context feature fusion.

Experimental Summary
In this experiment, the algorithms commonly used in data mining and classification are compared and tested first.
In the second stage experiment, the DWSA algorithm is tested. By assigning the dynamic weight value, the performance of the algorithm is significantly improved in terms of the comprehensive accuracy and the classification F1-score of each category.
Finally, the data processed by the DWSA framework is reorganized and outputted as structured information.

Conclusions
In this paper, a framework named DWSA for intelligent structural analysis of Chinese insurance documents is proposed. We use the techniques of data pre-processing, feature engineering, and structural classification to process the documents. The proposed framework provides a convenient platform for insurance practitioners and related researchers to survey insurance documents. Also, the DWSA can effectively address the problem of sample imbalance by the dynamic sample weighting algorithm. Verified by experiments, the DWSA algorithm can significantly improve the comprehensive accuracy and the classification F1-score of each hierarchical category. The comprehensive accuracy reaches 94.68% (3.36% absolute improvement) and the Macro F1-score achieves 88.29% (5.1% absolute improvement).  Figure A2. The file format in the data pre-processing step (in Chinese). We modify the Pdfplumber algorithm framework for PDF document information conversion, which can get more features in documents content. The features mainly include size, font family, top, and company etc. The features named top, left, width, and height mean the context relative location information. Figure A3. Structured document output. Document information is extracted for structural comparison.(Chinese output)