Patent Automatic Classification Based on Symmetric Hierarchical Convolution Neural Network

With the rapid growth of patent applications, it has become an urgent problem to automatically classify the accepted patent application documents accurately and quickly. Most previous patent automatic classification studies are based on feature engineering and traditional machine learning methods like SVM, and some even rely on the knowledge of domain experts, hence they suffer from low accuracy problem and have poor generalization ability. In this paper, we propose a patent automatic classification method via the symmetric hierarchical convolution neural network (CNN) named PAC-HCNN. We use the title and abstract of the patent as the input data, and then apply the word embedding technique to segment and vectorize the input data. Then we design a symmetric hierarchical CNN framework to classify the patents based on the word embeddings, which is much more efficient than traditional RNN models dealing with texts, meanwhile keeping the history and future information of the input sequence. We also add gated linear units (GLUs) and residual connection to help realize the deep CNN. Additionally, we equip our model with a self attention mechanism to address the long-term dependency problem. Experiments are performed on large-scale datasets for Chinese short text patent classification. Experimental results prove our proposed model’s effectiveness, and it performs better than other state-of-the-art models significantly and consistently on both fine-grained and coarse-grained classification.


Introduction
As an effective technology carrier, patent documents have been widely used in real-life applications. According to the World Intellectual Property Organization's (WIPO) report on intellectual property statistics, the number of global patent applications is rapidly increasing [1]. With such a dramatic growth of patent applications, relying on manual work to complete patent classification takes a lot of time and money, and is thus unable to meet the real needs. Therefore, lots of patent automatic classification (PAC) algorithms have been proposed. Some works focus on patent text representations, e.g., in order to identify a technical patent topic from a patent abstract; the authors of [2] combined semantic role labeling (SRL) information with framework rules of semantics, which improves the performance in SRL systems by obtaining knowledge from patents. Some works focus on the extraction of semantic features from patent texts, e.g., the authors of [3] extracted keywords related to technical information from patent documents, constructed keyword collections, and obtained the layer of keywords according to the level of information. Additionally, machine learning has drawn more attention for the PAC system, e.g., the authors of [4] developed an automatic patent quality analysis and classification system SOM-KPCA-SVM, using data mining methods to identify and classify the quality of new patents in a timely manner. However, most of the studies are highly dependent on feature engineering, which are unable to quickly and automatically classify the 'sections' and 'classes' into large-scale patent datasets tagged by international patent classification (IPC) numbers. Moreover, with the gradual refinement of patent categories and the increase of the number of categories, these PAC algorithms have an obvious reduction in classification accuracy.
Here we first introduce the detailed information of a patent. As shown in Figure 1, a patent document is a special text, which contains invention name, applicant, inventor, abstract, claims, classification number and other contents. In addition to the applicants and inventors, the rest of them usually show strong characteristics of domain knowledge, which are suitable for constructing the features for PAC algorithms. Therefore, the classification efficiency can be effectively improved by selecting appropriate features. In Figure 1, the purple box represents the title of the patent, the green box represents the abstract, the blue box represents the sovereign item, the red box represents the main classification number, and the brown box represents the patent classification number (all classification numbers). This paper mainly studies the fine-grained classification tasks of the main classification number, instead of analyzing all patent classification numbers. In classification tasks, if the classifier is trained using the whole content of the patent documents, it would be too time-consuming with a much higher complexity of the model. Therefore, it needs to make a balance between the accuracy and efficiency in practical applications. Based on the analysis of patent text, it is found that the training of the automatic patent classifier can be completed by extracting a few representative contents of patent documents, which can improve the PAC efficiency while maintaining the accuracy of PAC. Relevant research results show that satisfactory accuracy can be achieved by extracting the contents of patent abstract [6] or claims [7] in the tasks of patent classification. Therefore, this paper uses the titles and abstracts of patent documents to form the input data for training, and then uses the corresponding 'sections' or 'classes' labels to construct the target samples for PAC.
Specifically, to realize the fast PAC task, we use the word embedding technique to pre-process the input text. First we adopt the jieba Chinese segmentation model (https://github.com/fxsjy/ jieba) to segment the text data, and then apply the gensim [8] framework to project the words into low-dimensional vectors. Then we introduce a symmetric hierarchical convolution neural network (CNN) framework to encode the word embeddings for classification. Conventional CNN is only able to capture a fixed size context; however, our hierarchical CNN can easily enlarge the context by stacking CNN layers over each other symmetrically from the head and tail of the input. Through stacking CNN layers symmetrically, hierarchical representations on the input are created, that is, lower layers will process close words in sequences while higher layers process distant words [9], and using symmetric structure enables the proposed model to preserve both the history and future information. We equip our framework with gated linear units (GLUs) [10] and residual connection [11] to realize a deep CNN architecture. We also utilize the self-attention mechanism to address the long-term dependency problem when dealing with the long input sequence. This can help our model focus on the significant parts in the sequence for classification.
The main contributions of this study are as follows: 1.
To improve the efficiency of automatic patent classification, the model is trained by combining the short text of patent name and abstract.

2.
After pre-processing the input data by word embedding method, for further classification, we introduce a hierarchical CNN framework to create the representations of the input text.

3.
We conduct the PAC experiments in five 'sections' and 50 'classes' on large-scale real Chinese patent datasets. The experiment results verify that our model based on hierarchical CNN outperforms other advanced alternatives consistently and statistically significantly on both fine-grained and coarse-grained classification.

International Patent Classification System
The International Patent Classification List is a complete patent classification system [12], developed by the WIPO. Since its publication from the last century, it has made outstanding contributions to the management of global patent documents. The use of IPC number for retrieval does not rely on keywords, meanwhile, it makes it possible for patent documents of various countries to obtain a unified classification number. Currently, IPC has 12 languages including Chinese, English, and German. IPC is a hierarchical structure, consisting of section, class, sub-class, group, and sub-group. At each sub-level of the hierarchy, the number of categories is multiplied by about 10, so the IPC contains approximately 72,000 categories in total [13]. For China's IPC number, it is slightly different from the international IPC number. According to different technical fields, the section number is divided into eight sections from A: Human necessities to H: Electricity. Taking the Chinese patent IPC number G06F11/00 in Figure 1 as an example, G, G06, G06F and G06F11/00 represent the corresponding section, Class, Subclass and Group, respectively. Most works try to classify specific sections and classes in IPC automatically, whereas the researches on the classification task of all sections and classes are still limited.
The ultimate goal of the PAC study is to enable computer systems to accurately find IPC tags for patent applications, just like patent examiners. However, due to the complexity of the IPC system, the current existing PAC systems are not ideal.

PAC Algorithms
To build a PAC system, TRIZ (the Russian acronym of theory of inventive problem solving) oriented personalized classification method was proposed based on subject model [14]. The task of automatic patent classification can be accomplished by combining different feature items with machine learning algorithm and using feature dimension reduction to optimize the classifier. Grouping similar patents according to TRIZ user's invention principle was proposed to solve the problem of multi-label classification and class imbalance [15]. An automatic patent classification method was proposed based on the integration of multi-feature and multi-classifier [16]. Different dictionaries and paragraph vector features are used to train Bayesian and support vector machine (SVM) classifiers respectively, with the final patent classification completed with the feature-category matrix. M3-SVM attempted to Decompose the problem of large-scale unbalanced patent classification into a set of relatively small and more balanced sub-problems, and min-max modular support vector machine is then used to handle the subproblems [17]. Later, an artificial intelligence-assisted patent decision-making method was proposed [18]. A patent classification system was developed by combining expert screening and hybrid genetic support vector machine (HGA-SVM) model.
In recent years, with the improvement of hardware performance and the popularity of deep learning algorithms, some researches have applied CNN to text categorization tasks. CNN model was used to classify short text sentences [19]. Considering that short text often encounters data sparsity and fuzziness problem in representation due to the lack of context, a short text classification method was proposed combining semantic clustering and CNN [20]. This method introduces a fast semantic clustering algorithm through embedding layer, and uses supervised pattern to detect multi-scale semantic units, using external knowledge. Because of the complex and time-consuming process of text categorization, Facebook's open-source sentence categorization method and word feature learning model were proposed, Fasttext, on the field of Chinese text categorization and verified its effectiveness [21]. Compared with the current widely-used text classification methods, Fasttext can greatly reduce the computing time while guarantee the classification results.
However, those CNN based models like Fasttext can not effectively comprehend the text content, comparing with recurrent neural network (RNN) that processes a text word by word [22]. RNN still suffers from efficiency problem, so we propose a hierarchical CNN to balance the efficiency and effectiveness of the model.

CNN Framework
CNN has been widely used in the areas of speech recognition, image processing and natural language processing, et al. We construct a hierarchical CNN Model PAC-HCNN based on the Tensorflow platform [23]. The whole PAC-HCNN framework consists of six Layers: Input layer, preprocessing layer, word embedding layer, hierarchical convolution layer, fully connected layer, and output layer. The framework overview of the proposed framework is shown in Figure 2.  As shown in Figure 2, we first pre-process the original text content at the input level. We use jieba segmentation model to segment the input text into separate words or phrases, meanwhile, removing the stop words. Then we adopt the gensim framework to project these words or phrases into low-dimensional vector, that is, the word embedding layer generates the representation of the valid words after pre-processing. We will illustrate the pre-process procedure in detail in Section 4.2. We denote the i-th effective word in the input content as x i . Assuming that the input content contains m valid words, then the entire input sequence is x = (x 1 , . . . , x m ).
Before applying on convolutional layer, we add position embeddings to the above word embeddings. For the input sequence x = (x 1 , . . . , x m ), we first embed it as a low dimensional vector u = (u 1 , . . . , u m ), where u j ∈ R f and f is the dimension of the vector. We then embed the absolute position of input elements p = (p 1 , . . . , p m ) where p j ∈ R f . By doing so, it assigns our model the ability to be sensitive to the part being processed of the input. We combine both embeddings to construct the final input embeddings e = (u 1 + p 1 , . . . , u m + p m ).
In convolution layer, we introduce a symmetric hierarchical structure to flexibly encode the input embeddings. We represent the output of l − th layer as h l = (h l 1 , . . . , h l m ). A non-linearity and a convolution with one dimension form a layer. With kernel width being k in a layer, the output of the layer h 1 i will process k elements of the input. Through stacking blocks over each other symmetrically from head and tail part of the input sequence, the length of input elements could be enlarged. For instance, if we stack six blocks using a kernel width k of 5, 25 elements of the input will be processed. The non-linearity could be used to adjust the proposed model to handle only a few elements or the whole sequence when needed. Unlike the chain structure of RNN, our CNN framework with multiple and symmetric layers is able to parallelly represent a sequence with long range, meanwhile keeping the history and future information simultaneously. Specifically, given a input sequence having n elements, our symmetric hierarchical CNN needs O( n k ) operations, while O(n) operations are needed for RNN model. Additionally, in our CNN based framework, we have fixed non-linearities and kernel numbers, while for RNN model, n operations are needed to process the first words, and a single set of operations are needed for the last word [24]. It is obvious that comparing traditional CNN, our symmetric and hierarchical framework is not that efficient, since our model has to stack several layers while in traditional CNN, only one layer is adopted to process a sequence. Therefore, our framework actually makes a balance between efficiency and effectiveness.
In each convolution kernel, parameters are W ∈ R 2 f ×o f , b w ∈ R f . X ∈ R o× f denotes the input, in which o denotes the number of input elements and f is the dimension. After mapping the input via the convolution layer, we could obtain the output Y ∈ R 2 f . Note that the dimension of the output is twice of that of the input. Afterwards, our model will feed the o output elements to the following layers. For non-linearities, we apply them on the output using GLUs [10], denoted as where A, B ∈ R f are the inputs of GLUs and ⊗ represents the element-wise multiplication. σ(B) denotes the activation function which could decide to handle which part of the inputs during the time being. Residual connections [11] are also adopted so as to form a deeper CNN structure by connecting the input and the output of each CNN layer: We then leverage a self-attention mechanism to further encode the text as such mechanism is able to handle the long-term dependency problem by assigning higher weights on the words important for final classification. Specifically, to calculate the attention in each layer, given h l = {h l 1 , h l 2 , ..., h l n }, we first compute the relativity score between every input element as follows: Then the attention weight between every could be computed as: Afterwards the final representation vectors of each word is represented as: After all the operations above, the final probability of a patent belonging to a class could be generated as follows: where W f denotes the weights in the fully connected layer, L is the final layer and b f is the bias.

Data Collection
We have two datasets: (1) Published dataset of small-scale Chinese patent documents with more than 3000 original Chinese patents; (2) Data crawled from the published Chinese National Knowledge Infrastructure (CNKI) patent database (http://cnki.scstl.org/kns55/brief/result.aspx?dbPrefix=SCPD) using the open source Python web crawler framework Scrapy [25], with about 48,000 original patent documents collected in total. The entire dataset contains about 51,000 patent documents. After the data is deduplicated, denoised, and limited within section D/E/F/G/H, 29,457 valid patent documents were obtained. In our experiments, the above 29,457 documents were divided with 80% for training, 15% for validation, and 5% for independent testing.

Data Pre-Processing
The eight types of patent documents from A to H sections are automatically tagged with the original input content. The automatic tagging process is divided in three ways.

1.
The first way is to use patent names as the original input for classification. We use regular expression to directly extract the name and the corresponding classification number from a patent document, i.e., forming name-sections (e.g., H) tag data.

2.
The second way is to use patent abstracts as the original input for classification. We use regular expression to directly extract the abstract and the corresponding classification number from a patent document, i.e., forming abstract-sections (e.g., H) tag data.

3.
The third way is to use patent names and abstracts as the original inputs for classification. We use regular expression to directly extract the name, abstract and the corresponding classification number from a patent document, thereby forming the name and abstract-section (e.g., H)/Class(e.g., H01) tag data.
Based on the above processing steps, we can obtain all the tagged corpus. Before training the model, the original text needs to be pre-processed. The pre-processing part mainly includes the following three parts:

1.
Firstly, the original text of the input is processed as a word segmentation task. In the experiment, the word segmentation of the text is completed by jieba participle module of the Python version.

2.
Secondly, the common stop words collected on the Internet and the high-frequency words with low degree of discrimination in the patent literature are combined to form a stop word list, which is used to remove the stop word after the word segmentation, meanwhile reducing the feature dimension.

3.
Thirdly, a vectorized representation of words is implemented by combining word embedding techniques. In the experiments, the open source gensim [8] word vector training tool is used to pre-train all the experimental datasets, obtaining the corresponding embeddings.

Datasets
According to Section 4.1, we have 29,457 valid patent documents containing D/E/F/G/H labels in total. According to the pre-processing step in Section 4.2, we construct 4 sub-datasets for experiments.

1.
Dataset_ name is used to examine the performance of algorithms on section classification which only use the patent names as input.

2.
Dataset_ abstract is used to examine the performance of algorithms on section classification that only use patent abstracts as input.

3.
S_Dataset_ name+abstract is used to examine the performance of algorithms on section classification that combines the patent name and abstract as input.

4.
C_Dataset_ name+abstract is used to examine the performance of algorithms on class classification that combines the patent name and abstract as input.
The basic information of the four datasets is shown in Table 1.
In Table 1, the terms '#Train', '#Val', '#Test' and '#Label' represent the number of documents in training set, the number of documents in validation set, the number of documents in test set and the number of label tags, respectively.

. Evaluation Metrics
We use Precision, Recall, and F-measure (F 1 ) to evaluate the performance of the patent classification algorithms. Precision represents the accuracy of the classification, and Recall represents the recall rate of the classification. Precision and Recall rates are two widely used metrics in information retrieval. To evaluate the algorithm's performance in a comprehensive manner, we also adopt F-Measure metric, which is the weighted harmonic averaging result of Precision and Recall. The F1 metric is defined as follows: A higher F 1 indicates a better performance. With the number of experimental samples getting larger and the distribution of the dataset getting more balanced, the credibility of the evaluation indicators will correspondingly become higher. Finally, by calculating the average of each indicator, we can obtain the overall performance of the classification algorithm on the entire sampled dataset.

Experimental Results on Coarse-Grained Classification
In order to examine which part of the patent document is the most valuable part for patent classification, experiments are performed on three different datasets Dateset_ name , Dateset_ abstract and S_Dateset_ name+abstract using PAC-HCNN model. We only classify the patent section with only 5 targets, so it is a coarse-grained classification. Hyper-parameters that we set up are shown in Table 2. The model is then tested using the corresponding testing datasets. The detailed results are shown in Table 3. From Table 3, we observe that only using the patent name caused the worst coarse-grained classification performance. The average P value is 80.3%, while the average R and F 1 value are 82.1% and 81.2%, respectively. Only using the patent abstract had a slightly better performance. The average P, R and F 1 value are 83.7%, 84.1% and 83.9%, respectively. Using the combination of the name and abstract had the best result, with the average P, R and F 1 values reaching 85.7%, 87.8% and 86.7%. This demonstrates that considering more information of the patent can correspondingly improve the experimental performance.

Comparing with Other Models
In order to comprehensively examine the classification performance of PAC-HCNN, we compare it with seven state-of-the-art algorithms including Fasttext, gated recurrent unit (GRU), multinomial naive Bayesian (MNB), logistic regression (LR), support vector machines (SVM), random forest (RF) and multi-layer perceptron neural networks (MPLNN). For those models using conventional CNN, we only adopt Fasttext as it achieved the best results reported in its original paper. Except for the PAC-HCNN, Fasttext and GRU, the other models are implemented using the corresponding algorithms in the Scikit-learn [26] toolkit. We conduct both coarse-grained and fine-grained classification. For fine-grained classification, we use the C_Dateset_ name+abstract in which patent class having 50 category labels. In all the experiments, we use the average F 1 value as a comprehensive evaluation metric. We also adopt a paired two-tailed t test with p < 0.05 between our model and the other models. The experimental results are shown in Figures 3 and 4.  From Figure 3, we observe that deep learning methods PAC-HCNN, Fasttext and GRU achieve better results comparing those traditional machine learning methods. GRU outperforms Fasttext which proves that RNN based model could comprehend the patent text better than traditional CNN based model. However, PAC-HCNN obtains the best result and we attribute it to the use of symmetric hierarchical CNN which promotes the patent text representation ability even better than GRU. From Figure 4, we can find the same tendency which further proves the effectiveness of PAC-HCNN. Specifically, we find that the margin between PAC-HCNN and other models are bigger on fine-grained classification. For example, on coarse-grained classification, PAC-HCNN is 2.8% better than GRU while on fine grained it is 6.4% better. It proves that our hierarchical architecture is more suitable for fine-grained task. In conclusion, PAC-HCNN outperforms other alternatives consistently and statistically significantly on both coarse-grained and fine-grained tasks.

Performance Variation on Different Sections
We conduct an additional experiment to further examine the performance of PAC-HCNN on different sections. The results are shown in Figure 5. From which we could observe that our model shows considerable difference on the classification performance on different sections, especially on section D being the worst. It is probably caused by the distribution imbalance of the sampled data in different sections of the experimental dataset.

Computation Cost
We evaluate the computation cost of our model, GRU and Fasttext, as GRU and Fasttext achieved comparative results which are much better than other alternatives. We extract a subset of C_Dateset_ name+abstract with 100 samples. The device we use to evaluate the efficiency issue is GTX-1080 GPU. Table 4 presents the experimental results. From Figures 3 and 4 and Table 4, we observe that our model not only outperforms GRU model which is based on RNN, but also being more efficient. We attribute it to the use of our hierarchical CNN which can parallelly process the input sequence to improve the efficiency, meanwhile boosting the performance via the deep structure stacking convolutional layers. However, our model is not that efficient comparing Fasttext, while the gap is not very large. This is due to that our model uses deeper nerual networks than Fasttext, achieving much better results but sacrificing acceptable time cost.

Discussion
Our model has a lot of potential and real-life applications. Accurately classifying the patents into a fine-grained field could help researchers find the most related patents and filter out other noise. For example, in a medical research, it could offer a shortcut access to a patent focusing on a certain disease or a certain operation.

Conclusions
In this paper, we propose a symmetric hierarchical CNN framework to realize IPC classification. We first extract the patent name and abstract to form the training datasets and leverage the word embedding technique to transform the text into low-dimensional vectors. Afterwards we propose hierarchical architecture stacking convolutional layers to represent the input word sequence; meanwhile promoting efficiency by comparing those RNN based model. We also incorporate the GLU mechanism and residual connection with the framework to realize a deep CNN. The self-attention mechanism is also introduced to handle the long-term dependency problem. We then conduct extensive experiments on real-life patent datasets, the experimental results verify the effectiveness and efficiency of our model for IPC number classification.

Conflicts of Interest:
The authors declare no conflict of interest.