On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classiﬁcation of Biomedical Patents

: A core task in technology management in biomedical engineering and beyond is the classiﬁcation of patents into domain-speciﬁc categories, increasingly automated by machine learning, with the fuzzy language of patents causing particular problems. Striving for higher classiﬁcation performance, increasingly complex models have been developed, based not only on text but also on a wealth of distinct (meta) data and methods. However, this makes it difﬁcult to access and integrate data and to fuse distinct predictions. Although the already established Cooperate Patent Classiﬁcation (CPC) offers a plethora of information, it is rarely used in automated patent categorization. Thus, we combine taxonomic and textual information to an ensemble classiﬁcation system comparing stacking and ﬁxed combination rules as fusion methods. Various classiﬁers are trained on title/abstract and on both the CPC and IPC (International Patent Classiﬁcation) assignments of 1230 patents covering six categories of future biomedical innovation. The taxonomies are modeled as tree graphs, parsed and transformed by Dissimilarity Space Embedding (DSE) to real-valued vectors. The classiﬁer ensemble tops the basic performance by nearly 10 points to F1 = 78.7% when stacked with a feed-forward Artiﬁcial Neural Network (ANN). Taxonomic base classiﬁers perform nearly as well as the text-based learners. Moreover, an ensemble only of CPC and IPC learners reaches F1 = 71.2% as fully language independent and straightforward approach of established algorithms and readily available integrated data enabling new possibilities for technology management. ANN—Artiﬁcial Neural Network, kNN—k-Nearest-Neighbor, LogReg—Logistic Regression.


Introduction
The analysis of patents is one of the core duties of technology and innovation management with varying purposes and perspectives, such as forecasting emerging technologies, assessing performances of regional/national innovation systems, mapping technologies, managing R&D activities, or evaluating the collaboration potential at company or policy level [1,2].
An essential subtask within these processes is classifying patents into coherent groups of similar documents as base for further retrieval and assessment. To facilitate such approaches, examiners at the national authorities use official taxonomies to assign patent applications according to their content to one or more classes. To overcome the so far existing separation, in 2013 the European Patent Office (EPO) and the United States Patent Office (USTPO) jointly released the Cooperative Patent Classification (CPC), which compared to the previously used International Patent Classification (IPC) now comprises about four times the number of different classes [3]. However, in order to assess domain-specific emerging technologies, either the use of concordance tools/tables to map, e.g., medical technologies [4] or a complete reclassification of patents beyond the official taxonomy systems is still indispensable [1].
Facing the huge and still expanding number of worldwide patents [5], the implementation of automated text categorization methods becomes vital to keep up with the rapidly growing body of knowledge. Many approaches have been deployed differing in features, such as Machine Learning (ML) algorithms, domains etc. Most of these approaches use the bag-of-word model to transform the text into term frequencies as real-valued features for machine learning (Term Frequency-Inverted Document Frequency-TF-IDF) [6]. In order to improve the classification performance, advanced methods incorporate more and more (meta) data of several distinct sources [7].
However, the subsequently increasing complexity raises new difficulties in mastering both varying methods and different data sources, such as data accessibility and integration (deduplication), the number of features in sparse matrices, the need of a method to combine multiple predictions, or the necessary computing power. Surprisingly, only a few studies implement IPC, and even less use the CPC assignments as an additional information source, although the data are readily available jointly with patent texts.
Furthermore, the well-studied fusion of different classifiers promises improved performance results especially when the diversity of the base classifiers is high [8]. A plethora of fusion methods, e.g., rule-based approaches such as summing and averaging, or stacking with machine leaning algorithms acting as fusion classifiers, are available to customize ensemble classification systems to specific tasks.
Despite of these advantages, so far patent categorization into user-defined groups via ensemble classifier systems has rarely been studied. In particular, the use of the taxonomies mentioned above could be a major step to overcome given limitations, since IPC and CPC represent a graph that provides available and extensive information. However, to be used in a classifier ensemble, the graph representation must be transformed into a real-valued vector. Riesen and Bunke [9] offer an outstanding solution for this challenging task with the Dissimilarity Space Embedding (DSE), which provides a dense real-valued vector of graph edit distances.
Thus, we present a novel ensemble classification system based on the combination of text and taxonomic features to enhance both applicability and performance in a realworld set-up, namely the mapping of patents as an important R&D output of biomedical engineering into six fields of future innovation (e.g., telemedicine, imaging, and implants). The central question is to what extent classification performance can be improved without significantly increasing the complexity of the approach.
To achieve this, as far as possible robust technologies and models and only one source providing integrated data will build up our approach. In detail, title and abstract of biomedical patents are processed according to the bag-of-words model. Additionally, the IPC and especially the much more detailed CPC are transformed by means of DSE, an established method that-to the best of our knowledge-is now applied to patent taxonomies for the first time. Both textual and taxonomic features serve as input to four different machine learning base classifiers to assign patents into six classes of future biomedical innovation. By pairing two base learners applied on diverse features (one text-and one taxonomy-based) selected according to the differing performances on validated test data (low vs. top performer) and varying the fusion methods systematically between stacking or fixed combination rules, our extensive comparisons totals in 64 different ensembles. The results show at first that IPC and CPC related learners achieve comparable performances as those using textual features in base classification and secondly that the top taxonomic base classifiers contribute substantially to the overall performance when stacked with also top performing base learners on textual features.
The remainder of this paper ranges from analyzing of the related work, to describing materials and methods ending up at presenting and discussing the results. The conclusions point out the major findings and important future perspectives.

Official Patent Classification Systems
The IPC system with approximately 70,000 classes and its extension, the unified CPC system with about 250,000 classes, are currently in use at the different patent authorities. As part of any patent application process, the examiners at the patent offices assign up to several hundred IPC/CPC codes to every filed patent to describe its content [3]. This independent assessment of the content makes both codes with their availability and language independence extremely valuable for classification tasks. While the first three levels of CPC and IPC (section, class, subclass) show a high degree of similarity, the greater level of detail provided by CPC is particularly evident from the group level and below [3]. Despite this new wealth of information, in general only a reserved scientific discussion of the CPC system has been conducted. Since 2013, about four times more publications listed in Web of Science deal with IPC than with CPC.

Automated Text Categorization using Patents
Classification algorithms are successfully applied to patent texts. In early approaches full texts serve as basis for automated categorization using k-Nearest-Neighbor (kNN), naïve Bayes, Support Vector Machine (SVM) or back-propagation neural network [10,11]. In order to improve the classification performance, which is reduced, inter alia, by the legal and blurred language of patents, further data from distinct sources were combined into increasingly complex models. For example, Liu and Shih [7] suggest a hybrid classification with a weighted linear combination of content-, citation-, metadata-, and network-based predictions, which outperformed each single contribution with F1 = 86.4%. However, this performance is contrasted by an enormous complexity and effort: kNN and SVM, cosine similarity, ontology-based network analysis, more than 27,000 patents for training purposes, and the adjustment of weights for the final prediction. All this illustrates the methodological challenges and the workload, which have to be overcome.
In contrast, with only 1600 training objects a multiclass setting can successfully be established using SVM (66.4% accuracy) leading the classifier ranking followed by random forest and kNN [12]. Furthermore, SVM served as both base and fusion classifier in an ensemble approach and achieved F1 = 77.7%, further increased by implementing user feedback as active learning (F1 = 84.2%) [13]. In order to achieve these high performances in the chosen multiclass classification, Zhang interactively extended his model not only to active learning, but also by reducing the text features using principal component analysis or reinforcing the training process through so-called dynamic certainty propagation.
IPC or CPC have rarely been used as a knowledge source, but rather as a target for automated document categorization to relieve the examiners from manual work [7,11,14]. Recent deep learning approaches such as BERT as well proved their predictive power in mastering the assignment of patents at the CPC subclass level [15]. In general, this evolving number of approaches with extensive pre-trained language models has a great potential for semantic tasks such as text classification. However, a high complexity (e.g., in terms of fine-tuning) and demanding huge amount of training data limit the applicability to classify patents, especially into user-defined categories with usually very limited amount of training samples. Therefore, our approach focuses on applicability with established methods as a solid basis for a proof of concept. This could serve as a basis for future studies to evaluate whether more complex methods such as BERT are worthwhile in terms of effort and outcome.

Ensemble Classification
There is an extended body of research and general knowledge on combining classifiers [16][17][18] to achieve, e.g., an improvement of the overall performance [8,19] despite of noisy training data. Especially the diversity provided by the base classifiers bolsters the capability of their fusion [20]. Among different options to introduce this diversity, e.g., varying the base ML classifier, the use of distinct features extracted from the same data are considered to be the most promising approach [8].
The final performance depends on the scheme of combining the basic predictions. Both methods, fixed combinations rules (FCR), i.e., combining the estimates of posterior class probability using algebraic operations such as summation or product [21], and stacking, i.e., using a classifier to fuse the probability estimates of the base learners [22], are successfully deployed in text classification. Surprisingly, patents are very rarely the subject of classifier combinations, even though, unlike scientific texts, patents are considered particularly difficult to classify. This is true even when considering boosting, a technique in which a larger number of weak learners are successfully combined to achieve a strong ensemble performance. During the last 25 years, especially adaptive boosting (AdaBoost) [23], providing increased weights during training to the base learners with higher predictive power, has been studied intensively. AdaBoost works especially well on weak learners performing only slightly better than a random classification, whereas using stronger base classifiers such as SVM does not lead to improved results [24]. More recently, Lee uses a combination of topic modeling and AdaBoost to predict the suitability of patents for further technology transfer ) and achieves a maximum F1 value of approx. 0.589. Apart from patent classification, currently AdaBoost with SVM is, e.g., successfully applied to issues such as imbalanced streaming classification and concept drift [25,26].

Feature Extraction from Graphs
The IPC and even more the CPC taxonomy constitutes a powerful representation of information regarding the content of the underlying patents, structured as a graph of nodes and edges. A plethora of research work has been performed on studying graphbased pattern representation [27]. To utilize patent taxonomies as input for machine learning classification the graph representation (G) must be transformed into a real-valued feature domain.
Besides extracting structural information of graphs with, e.g., the adjacency matrix or the Laplacian matrix methods [28], the widely used Graph Edit Distance (GED) approach compares two graphs by computing the distance as a cost function, i.e., on a minimized set of edit operations such as inserting, deleting, and substituting nodes or edges to transform one graph into the other. In their pioneering work on Vector Space Embedding, Riesen and Bunke [9] introduced the concept of a variety of selected prototypes, i.e., training graphs to which the GED of an input graph are computed. The resulting dissimilarity representation characterizes the specificity of the input graph as dense real-valued vector, which is well suited as input for the subsequent classification algorithms. Additionally, different graph edit approximations are successfully deployed even in an ensemble setting outperforming the best individual base classifier [29].

Materials and Methods
A classifier C solving a supervised learning problem for a given set of n classes Ω = {ω 1 , . . . , ω n } is described by a set of functions {c 1 , . . . , c n } defined as Here c i assigns every object from the input space X a non-negative score for each class ω i . Additionally, we expect the sum of all class scores to be equal to 1. Thus, the used classifiers are called probabilistic because their score outputs can be interpreted as the likelihood of an object x ∈ X belonging to the class ω i . To combine the strengths of different approaches to automatically assign patents to biomedical classes of future innovation a unified ensemble classification system is created as follows (see Figure 1): Figure 1. General structure of a classification model using different features V of representations R of input data for an ensemble of n base classifiers C merged by a fusion method F.
In our approach, at the input site two different representations of each patent document, namely text (title, abstract) and assigned taxonomies (CPC and IPC), are transformed into numerical features to be fed to machine learning base classifiers. After hyperparameter tuning, four different base classifiers compute their predictions from distinct features of the same object (Stage I) to be finally merged by various fusion methods providing the final prediction (Stage II). The whole two-stage-process (see Figure 2) has been implemented using Python and the scikit-learn library [30].

Patent Data
The training data consists of 1230 distinct patents, which are equally assigned to six classes of innovative biomedical devices (see Table 1). This dataset is based on our preliminary work [31]: the data were selected and labeled with a study-based keyword search and further optimized by experts. To evaluate the classification performance, an externally validated test data set is used comprising additional 94 patents, which were categorized by biomedical experts. All content of the patents, the text of titles and abstracts as well as the de-duplicated IPC and CPC codes were extracted from the patent database PATSTAT [32] and then transferred to feature transformation. In our approach, at the input site two different representations of each patent document, namely text (title, abstract) and assigned taxonomies (CPC and IPC), are transformed into numerical features to be fed to machine learning base classifiers. After hyperparameter tuning, four different base classifiers compute their predictions from distinct features of the same object (Stage I) to be finally merged by various fusion methods providing the final prediction (Stage II). The whole two-stage-process (see Figure 2) has been implemented using Python and the scikit-learn library [30]. In our approach, at the input site two different representations of each patent document, namely text (title, abstract) and assigned taxonomies (CPC and IPC), are transformed into numerical features to be fed to machine learning base classifiers. After hyperparameter tuning, four different base classifiers compute their predictions from distinct features of the same object (Stage I) to be finally merged by various fusion methods providing the final prediction (Stage II). The whole two-stage-process (see Figure 2) has been implemented using Python and the scikit-learn library [30].

Patent Data
The training data consists of 1230 distinct patents, which are equally assigned to six classes of innovative biomedical devices (see Table 1). This dataset is based on our preliminary work [31]: the data were selected and labeled with a study-based keyword search and further optimized by experts. To evaluate the classification performance, an externally validated test data set is used comprising additional 94 patents, which were categorized by biomedical experts. All content of the patents, the text of titles and abstracts as well as the de-duplicated IPC and CPC codes were extracted from the patent database PATSTAT [32] and then transferred to feature transformation.

Patent Data
The training data consists of 1230 distinct patents, which are equally assigned to six classes of innovative biomedical devices (see Table 1). This dataset is based on our preliminary work [31]: the data were selected and labeled with a study-based keyword search and further optimized by experts. To evaluate the classification performance, an externally validated test data set is used comprising additional 94 patents, which were categorized by biomedical experts. All content of the patents, the text of titles and abstracts as well as the de-duplicated IPC and CPC codes were extracted from the patent database PATSTAT [32] and then transferred to feature transformation. To obtain numerical feature vectors from textual data, the widely used 'bag-of-words' model is applied after preprocessing the concatenated title and abstract by removing stop-words not providing any additional information. Final feature vectors for text categorization are created utilizing a TF-IDF value computation for each input term per text document [6].

Taxonomic Data
By implementing the following steps, the graph based taxonomic data of IPC and CPC are transformed into a real-valued vector to be used as input for machine learning algorithms.

Tree Creation
Since both used patent classifications comprise a hierarchically organized taxonomy, each can be represented by a tree. By additionally inserting a root node, the trees of all assigned class codes are unified in one tree structured graph (see Figure 3), which facilitates the computing of the distance between the structured codes of CPC/IPC taxonomy using the tree-edit-distance [33].

Tree Creation
Since both used patent classifications comprise a hierarchically organized taxonomy, each can be represented by a tree. By additionally inserting a root node, the trees of all assigned class codes are unified in one tree structured graph (see Figure 3), which facilitates the computing of the distance between the structured codes of CPC/IPC taxonomy using the tree-edit-distance [33]. The string representations of the IPC/CPC codes have to be parsed to build the required tree. However, a full deduction of the hierarchy as given in official CPC definition [34] solely from the code strings is not possible, because the different subgroup levels are not represented in the code strings. Thus, the resulting tree parsed from the code string is only an approximation of the actual defined structure, which is created by discarding eventually existing intermediate subgroup levels between the given code and the main group. For example, a patent is marked with code A61B5/0055. Since this is according to the official CPC definition a 2nd-level subgroup (see Table 2), the string parser discards  Table 2).
The string representations of the IPC/CPC codes have to be parsed to build the required tree. However, a full deduction of the hierarchy as given in official CPC definition [34] solely from the code strings is not possible, because the different subgroup levels are not represented in the code strings. Thus, the resulting tree parsed from the code string is only an approximation of the actual defined structure, which is created by discarding eventually existing intermediate subgroup levels between the given code and the main group. For example, a patent is marked with code A61B5/0055. Since this is according to the official CPC definition a 2nd-level subgroup (see Table 2), the string parser discards one intermediate subgroup. Nevertheless, the parsed tree carries as much information as possible to be extracted solely from the string representation of a CPC/IPC code.

Vector Space Embedding
We adopt the Dissimilarity Space Embedding method (DSE) to transform the CPC and IPC tree into a vector space following the work of Riesen and Bunke [9]. A graph object can therefore be described by its dissimilarity to a fixed set of other objects from the same domain (prototypes).
Given a graph domain G, a dissimilarity between graphs can be expressed using a distance function d : G × G → R ≥0 . Given such a distance function d, a dissimilarity space embedding γ can be defined as: A suitable distance function in a dissimilarity space embedding is provided by the graph edit distance (GED), given by the minimal cost sequence of all operations transforming the graph g 1 into g 2 using insertion, deletion, and relabeling of nodes and edges with the cost of each operation set to 1. For computing the general GED no efficient algorithm is known [35]. However, since IPC/CPC can be represented as ordered trees, the problem can be reduced to compute the tree edit distance utilizing the Zhang-Shasha algorithm [36]. Although being an efficient algorithm, no implementation of sufficient performance was available. Thus, we implemented the algorithm in rust GitHub repository of 'Tree edit distance algorithm implemented in rust': https://github.com/AME-SCM/tree-edit-distance.

Prototype Selection Methods
Before any features are created invoking the function γ, a set of graphs g 1 , . . . , g n called 'prototypes' needs to be fixed. Prototypes are chosen from the training data as the source of known data in advance and are not altered thereafter. Two selection methods with different capabilities [9] were applied to optimize the resulting real-valued feature vectors: (1) random prototype selector-selecting prototypes in a completely random manner, and (2) spanning prototype selector-selecting the median graph as first element of the training data. The further added graphs maximize the distance to the nearest graph of the already Appl. Sci. 2021, 11, 690 8 of 16 chosen one yielding a better representation of the graph domain. Hyperparameter tuning determines the number of prototypes needed to deliver meaningful features.

Classifier Selection and Hyperparameter Tuning
A set of four well established machine learning algorithms are selected particularly to enable comparisons to prior work: Support Vector Machine (SVM), k-Nearest Neighbor (kNN), Logistic Regression (LogReg), and feed-forward Artificial Neural Network (ANN). They all serve as both, base and fusion classifiers (see Figure 2).
To obtain performance estimates of each combination of classifier and input feature, a 10-fold cross-validation on the training data set was carried out when using grid or random search to optimize all hyperparameters (see Table 3; for details see Appendix A Table A1). Table 3. Grid of hyperparameters to optimize the performance of the base classifiers acting on different textual and taxonomic features in 10-fold cross-validation.

Classifier Hyperparameter
Feature: Text For hyperparameter tuning and all further evaluations conducted on the externally validated test data set, we employ the overall micro-averaged F1 score [7], representing a balanced ratio of recall and precision. The ranking of the classification performances (F1) in training and testing provided an ordered list according to the summed-up ranks. The top and bottom ranked classifier from this list were selected for later combination following the notion, that in some cases even less well performing base learners might contribute substantially to the overall ensemble outcome [8].

Fusion Methods
A combination of classifiers C 1 , . . . , C m can be defined as: Accordingly, the fusion method F systematically combines the predictions of the used base classifiers C 1 , . . . , C m processing an input object x. Two approaches to obtain a fusion method F are used.
Fixed combination rules combine the class prediction output of base classifiers in a predefined way by utilizing simple arithmetic operations (sum, product) or basic set operations (minimum, maximum). Considering the classification output for input object x and class ω i of base classifier C j as c j,i (x) the sum rule is given by: The product, minimum, and maximum fixed combination rules are constructed likewise [19].
Stacking interprets the combination of base classifiers' output as further classification problem creating a meta-classifier. The base classifiers' output is fed into a classifier for objects with known class assignment. In comparison to the fixed combination rules, this requires an additional training phase including hyperparameter tuning (for details see Appendix A Table A2) for the used meta-classifier while providing more flexibility to adapt to the learning data.

Experimental Design
The final evaluation consists of a variety of differently composed ensembles, which always consist of two base learners each. These ensembles differ in three factors, namely the feature used, the performance of the base learners, and the applied fusion method (see Table 4). Overall, 64 different ensembles were tested and the corresponding F1 score was computed.

Basic Evaluation
The initial basic evaluation on the test data highlights the mutual dependency between the classification algorithm and the type of feature (see Figure 4). SVM and kNN perform best on text (69.1%) while ANN reaches nearly the same top score on CPC (68.1%) as Logistic Regression (LogReg) on IPC taxonomy. In contrast, the weaker base classifiers only achieve F1 values between 53.2% (CPC-kNN), 54.3% (IPC-kNN) and 60.6% for LogReg on text.
Considering the results of prior research [12], it was to be expected that SVM would again prove to be a powerful tool for text classification in the present task. The almost equal quality of classification achieved with both patent taxonomies using ANN or LogReg is all the more remarkable, because it is only about one percentage point below the best value of SVM on text. Different from the majority of past approaches that use the taxonomies as target of automated categorization [11,14], our results place both IPC and CPC close to the text of patent documents to be used as a valuable source of information suitable for machine learning approaches in multiclass environments.
Since medical technology is an international domain, its technology management is forced to analyze patents worldwide originally written in many different languages. However, in many countries, such as Germany, patents are not translated into English by default. Thus, only 51.6% of all records in the international patent database provided by the European Patent Office (PATSTAT 2019 autumn version [32]) contain an English title and abstract. This is a drawback for international technology management, as automated patent categorization based on textual features usually depends on the common language of the data.

Fusion Method
Fixed Combination Rules (sum, product, min, max)

Basic Evaluation
The initial basic evaluation on the test data highlights the mutual dependency between the classification algorithm and the type of feature (see Figure 4). SVM and kNN perform best on text (69.1%) while ANN reaches nearly the same top score on CPC (68.1%) as Logistic Regression (LogReg) on IPC taxonomy. In contrast, the weaker base classifiers only achieve F1 values between 53.2% (CPC-kNN), 54.3% (IPC-kNN) and 60.6% for LogReg on text. In contrast, CPC or IPC codes are language independent and very frequently assigned to the patent documents. In the PATSTAT database 85.2% of documents are assigned with ICP or CPC codes. Therefore, the IPC/CPC taxonomies could potentially bridge the language gap by even replacing text features in machine learning classification systems of international patents. We have explored this prospect in a preliminary analysis, the results of which are reported in Section 4.4, Outlook.

Ensemble Evaluation
The fusion of base classifiers reveals in most cases (60/64) an improvement of classification performance on the test data compared to the best base classifier of each ensemble (see Table 5). Table 5. Resulting performance (F1) of the ensemble classification displayed as pivot table of different base classifiers combinations using stacking or fixed combinations rules as fusion method. The combination of the base learners in pairs of one text and one taxonomic (either IPC or CPC) classifier is additionally varied according to the Base Performance Ranking (BPR: top or bottom rank). For each of those conditions four different stacking classifiers as well four fixed-combinations rules are applied (see Table 4). Peak values are printed in bold. The improvements reach a maximum of 9.6%, which in two cases leads to the peak value of the entire performance matrix of F1 = 78.7%, namely by combining SVM (Text) with ANN (IPC) or ANN (CPC). At this stage, two first conclusions can be drawn: (1) in our approach, ensemble classification using text and taxonomic features in general proves to be a very effective method to enhance the overall classification performance; (2) our achieved maximum value certainly holds up to the comparison to multi-classifier fusion with F1 = 77.6% [13] before adding active learning components or network-based classification with F1 = 76.2% [7]. The latter approach only reaches F1 = 86.4% by hybridization of four different patent classification approaches. The impressive performance is counterbalanced by a high complexity in using a wide variety of methods and distinct data sources. In contrast, for our model the IPC/CPC codes and titles/abstracts of the patents are readily available from the same structured database [32]. Except DSE, the deployed feature transformations, the base classifiers, and fusion methods are all well-established, which further strengthen the applicability and efficiency of the presented pipeline. Finally, the fusion by the fixed combination rules sum and product achieved remarkable top scores of F1 = 77.7%, only one mark behind the overall peak using stacking, without the burden of training and hyperparameter tuning.

Boosting
To analyze the impact of the performance rank of base learners on the final ensemble outcome, the F1-results are averaged over this basic condition. They show a clear trend (see Table 6): The best performances are achieved by combining two top ranked learners in both fusion methods, stacking and fixed combination rules. Consistently, combining only bottom ranked classifiers places last in the ranking of the final outcome. Hence, we conclude, that in our case comparably weak base classifiers are not capable to strengthen the ensemble categorization. This somehow contradicts the fundamental notion which underlies the well-established boosting approaches, that have been just recently applied very successfully, e.g., in stream data classification and concept shift problems. However, the specific characteristic of boosting is that ‚weak learners' are defined as performing only slightly better than random categorization. For example, Lee and colleagues [37] applying boosting to predict the transfer potential of patents finally reach a top F1 score of 0.589. Compared to our approach, even the bottom ranked base learners perform at the same level (F1 = 0.532) and can consequently not be seen as truly 'weak' classifiers. In general, it has been proven that AdaBoost is not successful with strongly performing base learners [38].

Outlook
As discussed with the basic evaluation above, the best taxonomic base learners are performing nearly as well as the still leading text-based classifier SVM (see Table 5). This raises the question whether a combination of just strong IPC and CPC-based classifiers, omitting all text features, might boost the overall performance, perhaps even beyond the scores of the best text-based learners. Thus, a corresponding preliminary experiment has been conducted: the base learners LogReg (IPC) and ANN (CPC) are combined by fixed combination rules as well as using stacking (see Figure 5).

Outlook
As discussed with the basic evaluation above, the best taxonomic base learners are performing nearly as well as the still leading text-based classifier SVM (see Table 5). This raises the question whether a combination of just strong IPC and CPC-based classifiers, omitting all text features, might boost the overall performance, perhaps even beyond the scores of the best text-based learners. Thus, a corresponding preliminary experiment has been conducted: the base learners LogReg (IPC) and ANN (CPC) are combined by fixed combination rules as well as using stacking (see Figure 5). Whereas the fusion with FCR provides no improvement, the stacking with both ANN and SVM enhances the overall performance up to an F1 score of 71.2% on the test data. This outperforms the so far best basic performance of a text-based SVM and is completely independent from the language of the written patent information. Considering the PATSTAT patent database, now more than 22 million additional documents, lacking English titles/abstracts, could be included into a technology analysis based on automated categorization. The potential seems to be very large, but beforehand further studies have to Whereas the fusion with FCR provides no improvement, the stacking with both ANN and SVM enhances the overall performance up to an F1 score of 71.2% on the test data. This outperforms the so far best basic performance of a text-based SVM and is completely independent from the language of the written patent information. Considering the PATSTAT patent database, now more than 22 million additional documents, lacking English titles/abstracts, could be included into a technology analysis based on automated categorization. The potential seems to be very large, but beforehand further studies have to evaluate the scalability of our preliminary results to further domains and larger sample sizes.
Furthermore, modern deep learning approaches such as BERT or ELMo [39,40], with extensive pre-trained language models including multilingual and domain-specific variants (e.g., BioBERT [41]), show a great potential in semantic tasks and might boost our patent classification also in terms of cross-lingual patent data. However, it remains to be clarified whether the high effort of such complex approaches is also worthwhile in terms of applicability and performance. The results and data of our work might serve as basis for an elaborate evaluation as part of future work.

Limitations
In this paper, for the first time-to the best of our knowledge-IPC and CPC taxonomies are applied as graph-based information resources for automated patent classifica-tion. Despite the success, some limitations remain. Although we conducted an extensive hyperparameter tuning, the cost of 1 for each operation to calculate tree edit distance stayed untuned. Altering the costs in preliminary experiments modified the overall performance and should therefore be investigated in future work. Likewise, using solely the CPC/IPC's string representation to parse the corresponding tree will not extract all the available information. Thus, with a method to display the full subgroup relationship more information could be inserted to the CPC code tree.

Conclusions
We were able to show that the official patent taxonomies, IPC and CPC, contribute substantially to the performance of automated patent categorization into user-defined classes when the graph is transformed into a real-valued vector space by Dissimilarity Space Embedding. The multiclass classification with taxonomic base classifiers achieves F1 values close to the top text classifier. The fusion of the best performing taxonomic and textual base classifiers results in an overall increase in performance by +9.6% to a final F1 score of 78.7%.
This does not only unlock the potential of hierarchical patent taxonomies as valuable part of an ensemble to enhance automated patent categorization. In addition, ICP and CPC are language-independent, which makes the difference: the solely taxonomic ensemble with stacking performs even better than the best text-based learners. This opens the access to millions of additional patent documents enabling new possibilities for the management of biomedical technologies.
Overall, the deployed methods are well established in research and all used data are accessible from one integrated source. Even the novel implementation of the taxonomic CPC and IPC graph is built upon a well-descripted procedure (DSE) to transform the information into a real-valued vector. Consequently, this increases the applicability of the whole approach.
Our novel approach could also make an important contribution to the digitization of the health care system. For the annotation of health data, medical category systems such as the International Classification of Diseases or SNOMED CT are used. Accordingly, this graph-based information could now contribute to future solutions using the presented approach, thus advancing AI implementations in decision support for diagnosis and therapy.
Author Contributions: Design of the approach, software, investigation and draft manuscript preparation, K.F.; data curation, software testing, formal analysis, M.B.; visualization, S.G.; validation, resources, supervision, R.F. All authors have cooperated in conceptualization as well as in editing and revising the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by Klaus Tschira Foundation gGmbH, Heidelberg, Germany.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
Restrictions apply to the availability of these data. Data were obtained from European Patent Office (EPO) and are available at https://www.epo.org/searching-for-patents/ business/patstat.html with the permission of EPO.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A   Table A1. Results of hyperparameter tuning of the base classifiers: feed-forward Artificial Neural Network (ANN), k-Nearest-Neighbor (kNN), Logistic Regression (LogReg), and Support Vector Machine (SVM). The F1 scores for testing and 10-fold cross-validation (CV) were ranked (values in parentheses: Base Performance Ranking, BPR) for each feature. The summed BPR are used to select top (**) and bottom (*) ranked classifiers to build ensemble pairs for final evaluation.