Next Article in Journal
From Metrics to Meaning: Research Trends and AHP-Driven Insights into Financial Performance in Sustainability Transitions
Previous Article in Journal
Utilization of Sewage Sludge in the Sustainable Manufacturing of Ceramic Bricks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Italian Patent Multi-Label Classification System to Support the Innovation Demand and Supply Matching

by
Nicola Amoroso
1,2,†,
Annamaria Demarinis Loiotile
3,4,*,†,
Ester Pantaleo
4,
Giuseppe Conti
5,6,
Shiva Loccisano
5,
Sabina Tangaro
2,7,
Alfonso Monaco
2,4 and
Roberto Bellotti
2,4
1
Department of Pharmacy-Pharmaceutical Sciences, University of Bari Aldo Moro, 70125 Bari, Italy
2
Italian Institute of Nuclear Physics, Bari Section, 70125 Bari, Italy
3
Department of Electrical and Information Engineering, Polytechnic University of Bari, 70125 Bari, Italy
4
Interuniversity Department of Physics, University of Bari Aldo Moro, 70125 Bari, Italy
5
Netval—Network for Research Valorisation, 23900 Lecco, Italy
6
Head Office, University School for Advanced Studies IUSS, 27100 Pavia, Italy
7
Department of Soil, Plant and Food Sciences, University of Bari Aldo Moro, 70125 Bari, Italy
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Sustainability 2025, 17(14), 6425; https://doi.org/10.3390/su17146425
Submission received: 4 May 2025 / Revised: 21 June 2025 / Accepted: 24 June 2025 / Published: 14 July 2025

Abstract

The innovation demand and supply matching requires an accurate and time-consuming analysis of patents and the identification of their technological domains; since these tasks can be particularly challenging, this is why recent studies have evaluated the possibility of adopting Artificial Intelligence based on NLP techniques. Here, we present an automated workflow for patent analysis and classification devoted to the Italian patent scenario. High-quality data from the online platform KnowledgeShare (KS) were investigated: KS is the first patent management platform on the Italian innovation scene. A not secondary aspect consisted in determining which words mostly influenced patent classification, thus characterizing the corresponding research areas. Several models were compared to ensure the workflow’s robustness; Logistic Regression (LR) resulted in the best-performing model, and its performance compared well with the State of the Art. For each technological domain in the KS database, we evaluated and discussed its characteristic words; furthermore, a further analysis was focused on explaining why some domains, such as “Packaging” and “Environment,” were particularly confounding. This last aspect is of paramount importance to identify cross-contamination effects among research areas.

1. Introduction

Knowledge valorization is a priority in the European Research Area (ERA); the value creation and societal and economic impacts are part of the common set of values and principles for R&I in the European Union [1]. In modern knowledge economies, intellectual property assets are both drivers of development and factors of social transition. At the European level, industries that make intensive use of intellectual property rights (IPRs) generate 45 percent of annual GDP (EUR 6.6 trillion) and account for 63 million jobs (29 percent of all jobs) [2]. The development of knowledge-intensive and innovative companies is closely linked to intellectual property (IP) because most of a company’s net worth is contained in these intangible assets [3]; for them, the patent study can be a useful proxy for innovation and technological change and diffusion [4]. Since between 70% and 90% of the information about technologies is not published anywhere except in patent documents [5,6], patents are among the best sources of information [7] for the following:
-
Technological mapping and forecasting [8];
-
Predicting core and emerging technologies [9,10,11,12];
-
Diffusion of technologies [8];
-
Convergence of technologies [13];
-
Identification of technological vacuums and hotspots [14];
-
Portfolio analysis [15];
-
Competitive analysis [16,17];
-
Technology trend analysis [18,19,20];
-
Avoiding infringement [21];
-
Identifying technological competitors [14].
Patent documents contain a large quantity of metadata, and, at a time when data production, processing, and management have become key factors in the business of the Industry 4.0 economy, the analysis of these data can serve as a reflection of industrial evolution [22] and economic development. Data presents value for enabling a competitive data-driven economy, which is the core of the Internet of Things and Industry 4.0 [23,24]. The availability of big data represents a great chance to improve decision-making and strategic development and to introduce the next generation of innovative and disruptive technologies [17,25].
Thus, promoting the effective use and development of intellectual property and ensuring easier access to and sharing of IP-protected assets in times of crisis are key priorities of the Union’s IP Action Plan [26]. Universities are a mine of innovation and play an important role in their development; in fact, they are recognized as a critical element in the global competitiveness of enterprises [27,28,29]. Patent transfers between universities and companies have garnered significant attention recently. Previous studies demonstrated that the patent transfer has significant impacts not only on both academia and industry [30] but also on national economies [31,32,33]. To name a few studies, McDevitt et al. [34] showed that patent transfer provides public benefits and influences economic development in addition to generating revenue, increasing funding opportunities, and promoting a culture of entrepreneurship and innovation for universities; Roessner et al. [35] combined U.S. university licensing data from 1996 to 2010 with input-output economic models and found that the contribution of university licensing to the U.S. economy during the period was at least USD 162.1 billion [30,31].
However, technologies patented by universities and research centers, notoriously, at least at the European level, are poorly exploited; in fact, as reported in the ASTP Survey Report on KT Activities FY2019, only 18% of inventions are licensed or optioned [36]. In order to address this lack of valorization and help universities and research centers promote their research results, specific initiatives such as online patent platforms have emerged over time as convenient channels for patent transfer, with the joint effort of both academic and political partners [31,37]. The platforms enable the integration of isolated patents and provide communication and negotiation services to enhance the patent transfer [30,31]. An active and efficient patent marketplace could help reduce transaction costs and facilitate technology transfer, alleviating the information asymmetry problem [38].
Existing initiatives/platforms that are able to create matchmaking between supply and demand for innovation are sometimes ineffective, mainly for the following reasons:
-
They are paid services, not open access—often open innovation platforms;
-
They report the patent document as such, without a usable “translation” for all that facilitates matching;
-
The classification of the patent in a given technological area is a challenging task: users choose a category based on those proposed, but users often do not know how to choose the best, and it is not true that the proposed choices are necessarily the best.
One of the most prominent steps in managing patents is the classification, a particularly expensive and time-consuming phase, above all due to the increase in the number of filed patents and the complexity of the contents; this task is in fact conventionally performed by domain experts [39,40]. Moreover, patent classification is almost always a multi-label classification task, because an invention may be related to different technological areas or be cross-domain, which makes the problem even more complicated. Therefore, finding ways to automate this costly and labor-intensive task is essential to assist domain experts in managing patent documents, facilitating reliable searching, better matching of innovation demand and supply, especially by companies [39,41,42,43].
The solution proposed in this paper is an artificial intelligence-based system designed to accurately recognize technological areas, thereby classifying patents with high precision and recommending the appropriate direction to users, thus eliminating the subjectivity of choice. In this paper, we use a combined approach of Natural Language Processing (NLP) and machine learning (ML) in order to support the patent platforms by addressing two main aspects: on the one hand, it is of paramount importance to create an automated recommender system that can identify the most suitable and correct technological area(s) a patent must be assigned; this is both useful for users looking for specific technologies or patent owners who want to reach the largest fraction of potentially interested users. On the other hand, the methodology allows for the identification of keywords characterizing a patent in an objective and human-independent way; this aspect is particularly useful to create an initial vocabulary of words extracted from patents, thus eventually leading to redefine the available categories and supporting the portal management and, again, the matchmaking among users and patent-owners.
Though the combined framework of NLP-machine learning, in this paper, an explainable patent classification system is proposed for multi-label classification of patents, improving the user friendliness of a platform and enhancing the selection of suitable patents by a company or, in general, boosting the matchmaking of innovation leading to social and economic impact. The proposed approach has been implemented and tested on a patent Italian transfer platform named KnowledgeShare (KS), the largest Italian matchmaking portal for researchers, companies, investors, and innovators. KS is a joint initiative of Politecnico di Torino, the Italian Patent and Trademark Office at the Ministry of Enterprise and Made in Italy, and Netval (the Italian Network for the Valorization of Public Research), which aims to be a meeting point for companies and Italian Public and Private Research Centers. KS has two main goals: to generate a real social and economic impact at national level, transforming research results in useful and tangible innovation and provide a tangible support to not only Italian businesses to accelerate their innovation processes; to drive economic return for Universities and Public Research Organizations to be reinvested in new technology transfer activities within the public research system. KS is the only publicly available patent repository about Italian technology allowing for a thorough investigation of the Italian patent scenario which has not been carried out. In KS, the labels assigned to patents do not follow the International Patent Classification (IPC) and Cooperative Patent Classification (CPC) hierarchies but represent wide technological domains established by KS experts. Thus, an important question about the significance of this categorization naturally arises. Finally, the KS database is particularly appealing for classification purposes as, on the one hand it reduces the number of available categories and mitigates the curse of dimensionality problem; on the other hand, it does not require any arbitrarily set cut-off about the classification hierarchy (e.g., class, sub-class, group, sub-group), allowing to deal with clear and well established categories.
The rest of the paper is structured as follows: in Section 2 we highlight the main features of the KS database; Section 3 outlines the NLP and Machine Learning methods we use in this work; Section 4 shows the results of our pipeline; in Section 5 we discuss the results also making comparisons with those found in the scientific literature highlighting strengths and weaknesses of our analysis; and in Section 6 we draw conclusions and future developments of this study.

2. Materials

The KS (Knowledge-share.eu) database we use in this work includes patents registered from 28 December 1999 to 16 August 2021, consisting of 1731 patents. These documents are uploaded to the platform by 89 Italian Research Centers, both public and private. The yearly number of patents is highly variable, depending on the filing date and the consequent policy about the protection of intellectual property at the national and international levels. This platform can be easily queried by users aiming at obtaining an overview of the state-of-the-art about particular technologies and ground-breaking startups in Italy. On one hand, this service lowers firms’ and investors’ entry barriers for innovations in fundamental and applied science, letting them overcome R&D&I challenges more easily [44]. On the other hand, this platform helps scientists and startups in achieving visibility, expressing their innovative potential, and gaining interest from private and public investors.
KS carves a distinct niche in the international technology transfer landscape by blending a curated, high-quality pipeline with a proactive, service-oriented approach. Unlike many competitors that function as open, passive marketplaces, KS focuses exclusively on innovations originating from universities and research centers, ensuring a baseline of scientific rigor. The platform’s core strength lies in its active management of content. Every technology is presented using a proprietary, standardized template, which guarantees clarity, completeness, and ease of comparison for industrial partners—a feature largely lacking in competing platforms. Furthermore, KS differentiates itself by moving beyond purely algorithmic matching. It employs a team that actively curates content and provides targeted, human-validated recommendations to companies and investors. This proactive scouting, combined with a comprehensive suite of services including valorization support and training, transforms KS from a simple database into an end-to-end technology transfer facilitator. This unique model, which ensures high-quality, standardized information and adds a layer of human intelligence to the matching process, positions Knowledge-share.eu as a potentially more effective and reliable bridge between academic research and industrial application. Moreover, the collaboration with the Ministry allows Netval to provide KS free of charge for all its stakeholders, thus lowering a main barrier to the adoption and thus further supporting a widespread use and accessibility for SMEs and public research centers.
The KS platform allows the user to search for keywords, home institutions, or technological areas. By focusing on a particular patent, the marketing form appears, i.e., a sheet that shows some salient information about the patent, such as patent owner, inventors, patent status, priority number, priority date, licence, commercial rights, and availability, and the content of the patent through some keywords, an explanatory image, an introduction, the technical features, and the possible applications; Figure 1 shows the platform homepage.
Currently, there are more than 1520 registered users on the KS platform, including companies, investors, banks, stakeholders, and so on, which make KS the largest patent platform in Italy and the most accessed platform for patent investigation and technological transfer in Italy. One or more labels can be assigned to each patent; these labels represent the following ten technological domains, from now onwards referred to as categories:
  • Aerospace and aviation;
  • Agrifood;
  • Architecture and design;
  • Chemistry, Physics, New materials and Workflows (Basic Science);
  • Energy and Renewables (Green Energy);
  • Environment and Constructions (Environment);
  • Health and Biomedical (Biomed);
  • Informatics, Electronics and Communication Systems (Electronics);
  • Manufacturing and Packaging (Packaging);
  • Transport.
KS data is not equally distributed among labels, a well-known phenomenon in patent analysis studies. In particular, the most populated area is “Health and Biomedical” which accounts for over 30% of the entire database, while the least one is “Aerospace and aviation” with less than 3%, see Figure 2.
Some technological domains were excluded from the analysis as the number of patents was negligible with respect to other areas: “Aerospace and aviation”, “Architecture and design”, and “Transports”. Moreover, patents associated with more than two labels, accounting for less than 5% percent of the available data, were discarded. The final data consisted of 1527 patents and 7 possible labels. As a final remark, KS patents (written in Italian) were translated into English using the deep_translator package in Python (v 3.12.7), as NLP tools for English texts are more developed and consolidated than those for other languages [45].

3. Methods

3.1. The Overall Flowchart

As stated in the Introduction, in this work, we aim at two main goals: (1) classify patents in the KS database upon entering them into Artificial Intelligence models; (2) the identification of the most important keywords characterizing the labels (i.e., the corresponding technological areas). Accordingly, an approach based on the following 3 steps is designed and implemented: (i) textual analysis, (ii) multi-class classification, and (iii) important features underlying patent categorization. A pictorial overview is presented in the following Figure 3.
In the first phase, we use Natural Language Processing (NLP) tools to transform patents’ textual data into matrix form, which is a necessary step in order to use them as input in an Artificial Intelligence pipeline. In the multiclass classification phase, we use Machine Learning algorithms to assign the corresponding labels (i.e., the classes) to patents. In particular, to carry out these first two phases and assess the performance of the classification algorithms, as a zeroth-order approach, we discard patents’ temporal information and consider a 5-fold cross-validation framework repeated 100 times [46]. A pictorial overview of this framework is depicted in Figure 4.
This approach was adopted to highlight the Italian innovation scene, as it is represented by the KS database, more than assessing the temporal evolution. This approach is justified by an ex-post validation: bad classification performances of all the algorithms would indicate the presence of a significant number of disruptive patented innovations during the considered time period, so that training algorithms on a set of patents (included in the training folds) would not be helpful in correctly classifying the remaining patents (comprised in the test fold). Accordingly, we measure the performance of these algorithms using ranking-based evaluation indexes and find the most effective one in patent classification in the KS database [47].
Finally, in the third phase, we accomplish a feature importance task on the top-performer algorithm: we aim at identifying those words that mainly influence the classification of patents in the corresponding classes. In the following sections, these phases are explained in detail.

3.2. From Patents to Feature Representation

In every cross-validation iteration, the set of patents contained in the training folds is transformed, using the appropriate NLP tools, into a suitable representation which can feed machine learning algorithms. First, patents’ words (usually called tokens) are lower-cased. In fact, it is worth noting that learning algorithms are sensitive to upper/lower-case differences; thus, lower-casing every single word is necessary [48,49]; the result of this first pre-processing phase is threefold: uniforming data, obtaining a substantial dimensionality reduction, and yielding a correct and meaningful tokenization. Then, we removed stop words, as they are useful in setting up phrases but meaningless for semantic purposes [50,51]. Finally, we performed lemmatization, a procedure that reduces inflected words to their root form. For example, words like “fishing” or “fished” are reduced to “fish”. Even in this case, the goal is to ease the following learning phase; in fact, grouping words sharing a common root reduces the sparsity of the data, leading to a more compact representation, less prone to the curse of dimensionality [52,53]. Then, we end up with a set of words representing the vocabulary of the patents contained in the training folds.
All these pre-processing phases are propaedeutic for setting up the TF-IDF matrix (where TF = Term Frequency and IDF = Inverse Document Frequency) [54,55]. We chose to use TF-IDF for the sake of simplicity and interpretability. For example, while TF-IDF works directly on the corpus without any training phase, other techniques, such as Word2Vec, require training on a large corpus. Each TF-IDF matrix element is the product of two factors: TF(i,j), representing the frequency of the word i in patent j, and IDF(i,j), a weighting factor that takes into account the frequency of word i in all the patents. In particular, these terms are defined as follows: TF(i,j) = nij/|dj| and IDF(i,j) = log(D/|d(i)|) with nij being the number of occurrences of the word i in patent dj, D the number of patents in the database and d(i) the number of elements in D containing the word i at least once. |.| denotes the set cardinality, i.e., the number of elements within a set. The rationale of this product is the following: the first factor rewards high frequency of a word within a patent; the more a word is cited, the more its importance; the second term penalizes the high frequency in the whole set of patents because a word used in all patents would yield poor discrimination and emphasizes the role played by terms occurring rarely. The result is a matrix having a number of columns corresponding to that of patents in the training folds and a number of rows equal to the length of the respective vocabulary. If a patent has two assigned categories, we duplicate the corresponding column, associating one category with the original column and the other with its copy. Regarding the patents in the test fold, a corresponding TF-IDF matrix is built using the vocabulary of the training folds. This choice is due to avoiding data leakage between the training and test phases and, correspondingly, obtaining over-optimistic results about the generalization ability of Machine Learning algorithms.

3.3. Patent Categorization as a Multi-Class Classification Problem

The TF−IDF representation can be suitably exploited by any machine learning algorithm, with patents representing the available samples, word occurrences being the discriminating features, and the target variable the labels associated with each patent. Since, according to Section 2, at most two out of seven labels are associated with every patent, we are faced with a non-standard multi-class classification problem. In fact, in a standard multi-class classification problem, only one label (out of more than two labels) is associated with every sample. Nonetheless, since multiclass classification algorithms return a ranking of all the possible labels for every sample, we can safely use classical Machine Learning methods, but we need performance measures different from those usually used in standard multiclass classification. Accordingly, we have to refer to measures usually adopted in Information Retrieval tasks to evaluate Recommendation Systems [56,57]. In particular, we first consider two common metrics in these fields, that are defined for each sample:
Precision-top-k (P@k): represents the rate of correct labels for the i-th sample in the top-k labels as predicted by the classification model. In formula:
P @ k i = n u m b e r   o f   c o r r e c t   l a b e l s   i n   t o p   k   f o r   t h e   i-th   s a m p l e k
Recall-top-k (R@k) is the ratio between the number of corrected labels for the considered sample in the top-k labels as predicted by the classification model and the number of true labels for the sample. In mathematical terms:
R @ k i = n u m b e r   o f   c o r r e c t   l a b e l s   i n   t o p   k   f o r   t h e   i-th   s a m p l e n u m b e r   o f   t r u e   l a b e l s
Both these local metrics have values ranging from 0 to 1.
It is possible to define two corresponding global metrics from the previous local ones by averaging the local values we obtained for each sample:
P @ k ¯ = 1 N i = 1 N P @ k i
R @ k ¯ = 1 N i = 1 N R @ k i
where N is the number of samples under consideration.
It seems wise to underline that the information carried by the average Precision-at-k and the average Recall-at-k can be condensed in just one metric: the averaged F1-top-k ( F 1 @ k ¯ ). This metric is defined as the harmonic mean of P @ k ¯ and R @ k ¯ :
F 1 @ k ¯ = 2 × P @ k ¯ × R @ k ¯ P @ k ¯ + R @ k ¯
F 1 @ k ¯ ranges from 0 to 1 and is equal to 0 when at least one of the two metrics is 0, while it is equal to 1 when both of them are equal to 1. Since a precision value could be very high at the cost of a very low recall and vice versa, it seems wise to use F1@k as an unbiased performance measure [58,59]. In particular, following the scientific literature on automatic patent classification [60] we consider F1@k with k = 1 as our main performance metric. Accordingly, we can use this score to benchmark our work with the best F1 scores in other articles.
To evaluate the robustness of our pipeline, we consider and evaluate five different models: a Random Classifier (RC), serving as a performance baseline, Logistic Regression (LR) [61], Random Forest (RF) [62], Support Vector Machine with linear kernel (SVM) [63] and XGBoost (XGB) [64]. We consider these Machine Learning models since they are all able to return a measure of features’ importance, although through different mechanisms [65,66]. In fact, LR assigns a coefficient to every input feature, while tree-based models define the importance of a feature according to its ability to increase the purity of a node in the tree, as measured by the Gini index. Moreover, SVM with linear kernel, being a geometrical method that aims at finding a hyperplane in the feature space that separates classes, gives a weight to every dimension (i.e., to every feature). For all the considered algorithms, the standard configuration was used, without hypertuning their parameters. The rationale was to keep the models as simple as possible to achieve a fair evaluation of the informative content provided by the patents, not concealed by the algorithms’ learning phase.
It seems wise to underline that, even though all the models can be used to accomplish a multi-class classification, not all of them natively support it [67,68]. Actually, LR and SVM are able to carry out a binary classification task, not a multi-class one. Nonetheless, this problem can be circumvented by adopting a “One-versus-Rest” (O-v-R) or a “One-versus-all” (O-v-A) framework. In the first case, for each class, a binary model is trained in order to recognize the considered class against all the others. Accordingly, we end up with a model for each class. In the second case, models are trained to distinguish every class from each other, obtaining a much greater number of models with respect to the O-v-R case. In this work, we adopt an O-v-R approach because we can obtain which features (i.e., words) characterize every category with respect to all the others.

3.4. Mitigating Class Imbalance: SMOTE

In the previous sections, we described how to transform textual data into a matrix form and how to set up a multi-class classification problem, together with presenting the appropriate performance metrics. However, it should be noted that, as stated in Section 2, there is an imbalance among classes that can severely impact algorithms’ performance [69,70]. In order to mitigate this effect, we leverage the mathematical form of textual data and use SMOTE (Synthetic Minority Oversampling TEchnique) in order to obtain a perfect balance among classes [71,72]. In particular, we apply this technique in the training folds of every iteration in our cross-validation framework. Then, for every iteration, we train our models on these class-balanced training folds and validate them on the test fold. We then replicate this workflow for every repetition of the cross-validation framework.

4. Results

In this section, we show the results obtained by applying our pipeline to the KnowledgeShare database. In particular, we will show P @ k ¯ , R @ k ¯ and F 1 @ k ¯ for k = 1, for completeness, but we will use F 1 @ 1 ¯ to determine the best model. Once the best classification model is identified, we deepen its performance analysis, highlighting the most important words: those that mainly influence it in classifying patents. Moreover, we show how reliable it is in classifying patents in each of the seven categories of the KnowledgeShare database, and which are the most error-prone.

4.1. Top-k Performance Measures

In Table 1, we report, for each considered classification algorithm, the mean and standard deviation values (in brackets) of P @ 1 ¯ , R @ 1 ¯ and F 1 @ 1 ¯ as determined by our 5-fold cross-validation approach repeated 100 times.
Observing the F 1 @ 1 ¯ values in Table 1, it can be stated that the Random Classifier is the worst model, while all the other models clearly overperform it. This means that these last models effectively use words to classify patents. Moreover, it can be observed that LR and SVM have similar performances. Accordingly, we decide to perform a Welch’s t-test to establish if the mean values of F 1 @ 1 ¯ of the LR and SVM are equal or not in a statistically significant way. Considering a 5% significance level, we can reject the null hypothesis that the mean values of F 1 @ 1 ¯ of LR and SVM models are identical; but, at a 1% significance level, we cannot reject this null hypothesis. In order to remove this ambiguity, following the scientific literature, we consider the performances of LR and SVM as measured by F 1 @ 2 ¯ . We obtain the results reported in Table 2, where, for the sake of completeness, we report the results for all the models.
Performing a Welch’s t-test at both 1% and 5% significance levels, we can reject the null hypothesis and consider the mean performances of LR and SVM different in a statistically significant way. Accordingly, we can consider LR our top-performing model in classifying patents.

4.2. How Words Influence Patent Classification

According to the previous section, LR has proved to be the best model in classifying patents in the KS database. We can go deeper in analyzing LR performances, considering how the words in the vocabulary (i.e., the features of the model) influence the classification task carried out by the LR model. To this aim, since in the O-v-R scheme an LR model is built for every category, we evaluate the corresponding coefficients, which can be seen as a measure of the importance of the input features. Moreover, we also determine the words’ global importance ranking by concatenating the words’ coefficients of every category. We end up with a global ranking of 88,144 words. For the sake of clarity, Figure 5 shows the average importance for the first 1000 words in the global ranking. Applying the elbow method, we were able to quantitatively determine the number of most significant words, obtaining a set of 260. A more complete overview of these most important words, together with the category they belong to, can be found in Appendix A.
Figure 6 shows the partition of the 260 words in the 7 technological areas.
As regards Green Energy, the top words include topics related to major renewable sources, heat, electricity, fuel, and power. In the Agrifood area, the top words included specific products like wine, vegetable oil, and milk. Interestingly, among the top words, particular attention was given to the waste theme. In Electronics, the principal words were: user, network, circuit, software, signal, optical, and device. In the Environment and Constructions area, the focus seems to be on buildings with seismic properties and energy- and water-saving solutions. The only 5 words in the area concerning the Packaging appear not so relevant: component, object, material, packaging, and piece. On the contrary, the 5 words describing the Biomed sector are very meaningful because they seem to characterize the areas of greatest investment and innovation in recent years in the medical field: patient-centeredness, tissue engineering, treatment and diagnosis of cancerous and non-cancerous diseases, and stem cells and cell therapy. The last area, Basic Science, is described by only 2 words among the top 50: particle and solvent.

4.3. Confounding Categories for the Best Classifier

In this section, we explore which categories are the most confounding categories for the best-performing model, namely the LR. To this aim, in Table 3 we show, for every category, the mean frequencies with which it is confused with all the others in our cross-validation framework. Moreover, in Table 3 we also show the frequencies of correct classification of each category.
Observing these results, “Biomed” is the best-recognized category, meaning that it is a well-established and self-contained research area, with its own keywords and concepts, being rarely confused with other research domains. On the other hand, it can be observed that “Packaging” is the worst-recognized category being mainly confused with “Electronics” and “Basic Science”. This indicates that “Packaging” cannot be considered a standalone research area but, on the contrary, it may be seen as a cross-domain that benefits from innovations in Basic Science and Electronics. For example, in Packaging patents, it is possible to retrieve words such as “sensors”, “device”, “system” which can easily be found in other domains. It should also be kept in mind that the category Packaging also includes items belonging to the broader field of manufacturing, in Appendix B, two patents are reported as an example. In fact, an invention may cover several technological areas or be cross-domain, and this is becoming increasingly true in emerging technologies that are becoming more and more “multi-disciplinary” and struggle to remain “confined” to a single technological area. This highlights the need to create new technological domains that are not strictly tied to a single area.

5. Discussion

5.1. Best Method Performance

Patent Automatic Classification (PAC) algorithms have raised an increasing interest in recent years because of the ever-growing number of patents and the consequent need for high-quality analysis [56]. Moreover, correctly classifying patents means understanding why a patent is labelled with some categories and not with others. Then, we can query the classifier algorithm to know which words mainly influence the classification of a patent considering the model’s feature importance. Since this is our work’s main aim, we have to compare our best classifier’s performance with the state-of-the-art found in the literature. It should be noted that there is no officially acknowledged database on which the pipelines’ performances should be compared. Accordingly, there is neither a common set of patents nor a shared classification system of patents (e.g., CPC, IPC). Then, the significance of the following comparisons should be considered in this aspect. Moreover, it should be underlined that, as far as we know, our work is among the first in using a cross-validation approach, presenting mean and standard deviation values of P @ k ¯ , R @ k ¯ and F 1 @ k ¯ .
Beginning from articles having a dataset size similar to ours, both Fall et al. [73] and Hepburn [74] consider SVM together with other classification algorithms (k-NN, Naive Bayes) to classify 75,250 patents in 8 categories (IPC section level). In particular, Hepburn [74] uses a 60/40 train-test split and a particular transfer learning technique, reporting F 1 @ 1 ¯ = 0.784, without standard deviation, which improves the result of Fall et al. [73], which indicates F 1 @ 1 ¯ = 55%. Hu et al. [58] use Neural Networks (both Convolutional and Recursive) to capture semantic correlations among patents. The training, validation, and test datasets contain, respectively, 72,532, 18,133, and 2679 patents from the CLEF-IP competition dataset with 96 labels (CLEF-IP (http://ifs.tuwien.ac.at/~clef-ip/ accessed on 16 May 2023)). Their best F 1 @ 1 ¯ is 63.97%. Regarding the CLEF-IP competition itself, Verberne et al. [75] have their best F1@1 equal to 70.59% considering a series of classification experiments with the Linguistic Classification System (LCS). The training dataset has 905,458 patents, and the testing dataset has 1000 patents. Li et al. [56] built a Deep Learning model to classify more than 2 million patents in the USPTO-2M (https://www.uspto.gov/ accessed on 31 January 2024) dataset in 637 classes (the subclass level in IPC), reporting P@1 = 73.9% without disclosing F1@1. Haghighian Roudsari et al. [39] compare different Deep Learning models on the USPTO-2M dataset, obtaining a maximum F 1 @ 1 ¯ = 63.33%, with P @ 1 ¯ = 82.72% and F 1 @ 1 ¯ = 55.89%. Lee and Hsiang [60] improve this result by considering a novel USPTO-3M dataset, comprising more than 3 million patents and working both at the IPC subclass level and at the CPC subclass level (656 classes). It reports a maximum F 1 @ 1 ¯ = 66.71% and, correspondingly, R @ 1 ¯ = 54.92% and P @ 1 ¯ = 84.95%.
Considering that our best model (LR) has F 1 @ 1 ¯ = 73.5% (2.2%), P @ 1 ¯ = 80.1% (2.2%), R @ 1 ¯ = 67.9% (2.2%), we can say that our results are in line with the state-of-the-art found in the literature, thus justifying the investigation of LR words’ importance and the most confounding categories. Nonetheless, it should be pointed out how this analysis is limited by the sample size, and further evaluations are needed on a larger dataset; a possible strategy could be the collection of data from other countries which would also allow the comparison of the Italian situation with other countries. The Netval is already exploring this possibility along with the integration of the presented remarks in the platform. Of course, the limited sample size prevented this study from exploring a wide and popular class of algorithms, such as those based on deep neural networks. However, the goal of this work was to capture the Italian patent landscape more than delivering a decision support system; this is why we preferred to adopt simpler but definitely more interpretable strategies.

5.2. Categories’ Keywords and How They Explain the Confusion Frequencies

A detailed examination of the words in each category reveals some trends in the Italian innovation scenario and can even explain the confusion frequencies reported in Section 4.3.
As concerns the “Green Energy” category, the innovations speak about automatic systems for cleaning solar photovoltaic modules, water production from the air with solar energy, systems for concentrating the solar power, energy-efficient systems for the use of solar thermal energy, micro turbo gas systems, thermodynamic solar systems, and so on. Since these innovations deal with electronic devices and the exploitation of physical processes, it is not surprising that the main areas it is confused with are “Electronics” and “Basic Science”. Moreover, since green energy innovations aim at the protection of the environment, this strong relationship is reflected in the non-negligible confusion frequency of “Green Energy” with “Environment”.
With reference to the “Agrifood” area, the main identified words suggest research investments in innovating the quality and production systems of wine, lab-on-chip devices for the improvement of olive oil protection, and novel sustainable processes for milk production. The main innovations in this field involve biological processes and methodologies, and this is reflected in the relatively high frequency of confusion of this area with Biomed.
The “Electronics” research area reveals the presence of innovations about user-centered design methods, novel architectures of artificial neural networks, and, above all, deployments of synthetic biological circuits. This relatively new research branch is increasingly growing and explains the high frequency of confusion between “Electronics” and “Biomed”, which mainly encompasses biological tools and methodologies.
As regards the “Environment” category, the most frequent words concern innovations in: cleantech, automatic irrigation systems, water saving, energy production from water or wave, micro-geophysics, remediation of stone buildings, dehumidification processes, building maintenance services, and seismic isolation methods, and so on. It should be noted that these innovations are strictly related to the use of novel electronic devices and sensors and the pioneering exploitation of thermodynamic processes. These observations explain why “Environment” patents are mainly labelled as “Electronics” and “Basic Science”. Moreover, as noted in the previous discussion about the “Green Energy” category, there is a strong relationship between it and “Environment” innovations, which accounts for the relatively high level of confusion between them.
In the “Packaging” area, a number of health-related innovations characterize the research effort in this field: Hybrid locomotion stair-lift wheelchair, artificial muscle, human-robot cooperation, wearable robot for lower limb, exoskeleton for upper limb rehabilitation, adaptive wearable robot, support frame for upper limb exoskeleton, biomimetic active foot and ankle prosthesis. The remaining innovations include 3D printing, multi-sensor dimensional measurements, conversion of traditional vehicles, and language learning devices. It can be seen that innovations registered in the “Packaging” area are mainly linked to the development of novel electronic devices for relieving heavy lifting and learning methods using ad-hoc Artificial Intelligence algorithms. This can be seen as the main reason behind the high confusion frequencies characterising the “Packaging” area, already noted. In fact, it has the two highest confusion frequencies with “Electronics” and “Basic Science”. This clearly indicates that the “Packaging” category cannot be considered as a stand-alone research branch because it largely benefits from innovations both in electronics and in basic science research.
The innovations in the “Biomed” technological area range from wearable devices, methods for diagnosis and/or monitoring of infection, to new molecules for cancerous and chronic diseases, interventional radiotherapy, tests for early detection of tumors, theragnostic radiopharmaceuticals, cell therapies, and selection of cell populations. It should be noted that these novel techniques and devices mainly rely on Artificial Intelligence methodologies. In fact, the greatest part of patents are about methods in tissue engineering, an interdisciplinary area concerned with the development of functional 3D tissues by mixing cells, and bioactive chemicals. Accordingly, we can see that this explains why the “Biomed” category has non-negligible confusion frequencies with “Basic Science” and “Electronics”. Nonetheless, since the Biomed research is even highly characterized by biological-medical terms, the corresponding confusion frequencies are not as high as in all the other categories. Another interesting aspect concerning Biomed is that despite this being the largest community of patents, accounting for almost 30%, in the list of top words (Appendix A), only 12% of entries are related to the Biomed field.
Research efforts in the “Basic Science” category are multi-faceted: they range from innovative, recyclable, and low-cost solvents to field-effect transistor sensors, 3D super-resolution optical microscopes, and magnetic resonance imaging using metamaterials. Since many patents deal with bio-materials or describe bio-medical innovations, it is not surprising that “Basic Science” has a 20.4% mean frequency of confusion with “Biomed”.
In general, as regards the confusion frequencies, it seems evident that the technological areas so defined in KS are sometimes a bit “narrow” to categorize patents that are related to different technology domains, i.e., the 7 technology areas may not be sufficient to capture the entire landscape of academic patents. It would be interesting to use an unsupervised clustering approach, based on the textual content of patents, to develop an alternative and probably more consistent classification, improving the matching of supply and demand for innovation [76]. Of course, this represents a time-independent picture; the dynamical analysis of the Italian landscape was outside the scope of the present work, but we definitely think this aspect would deserve a deep exploration.

6. Conclusions

In this work, we present a fully automated patent multi-classification system based on NLP techniques and Machine Learning that highlights the most important words characterising the classification categories. Patent platforms such as the Italian one, named KnowledgeShare, were created to connect patent buyers with patent providers and help alleviate the problems of information asymmetry and distrust in traditional patent trading. To facilitate their mission and improve their efficacy, tools like the one proposed here are of fundamental importance. We demonstrated that the accuracy of the proposed framework compared well to state-of-the-art and, by relying on an easily interpretable model such as LR, it also allowed us to outline the keywords of each technological domain. An important feature of our methodology is that patents can be assigned to one or more labels simultaneously, which reflects the possibility for emerging technologies to be cross-domain. The proposed method is open access and supports both the patent managers providing a reliable recommendation about the categories which should be assigned to a patent and its technology and the end-users of technological platforms (e.g., investors, companies, …) providing them a quantitative evaluation of the similarity between the technologies they are looking for or they want to patent and the most suitable technological domains. Hence, the recommender system eases the matchmaking between innovation demand and supply, a fundamental aspect to generate social and economic impact with novel technologies. Although limited to the Italian patent ecosystem, the model is general and could be applied to patents from international research centres. As a future technical improvement, Artificial Neural Network architectures can be exploited, together with suitable explainability tools, in order to enhance both classification performance and, consequently, the insights into the innovation. Another fundamental upgrade would consist of the implementation of the model on the KS platform.

Author Contributions

N.A. Conceptualization, methodology, supervision, original draft preparation, writing-review and editing; A.D.L. original draft preparation, formal analysis, investigation; E.P. methodology, writing-review and editing; G.C. resources, data curation, writing-review and editing; S.L. resources, data curation, writing-review and editing; S.T. methodology, writing-review and editing; A.M. supervision, project administration, writing-review and editing; R.B. supervision, project administration, writing-review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

KnowledgeShare (KS) platform is the largest digital repository of intellectual property in Italy, managed by Netval, in partnership with the Italian Patent and Trademark Office (UIBM) of the Ministry of Enterprise and Made in Italy, and the Polytechnic of Turin. Netval is an institutional—not for profit—association founded in 2007 that groups Public Research Centers, Universities, and Research Hospitals with the aim of supporting and expanding the research valorization activities of its member institutions. KnowledgeShare platform is an initiative funded thanks to the Italian National Recovery Plan and the Next Generation EU programs. Giuseppe Conti and Shiva Loccisano, authors of this paper, are involved in Netval, respectively, as the President of the Association and Knowledge Share Project Manager.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions (e.g., privacy and legal reasons).

Acknowledgments

A particular acknowledgment to Netval and the Ministry of Enterprise and Made in Italy for sharing the dataset and for their willingness to experiment with new approaches for improving the matching of innovation supply and demand in Italy.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this Appendix, we report the 260 most important words, as determined by Logistic Regression (LR), our top-performer classification algorithm. In Table A1, we indicate these words in descending order of their corresponding coefficients in the LR model. Moreover, we also report the category in which words have these coefficients.
Table A1. The 260 most important words for the LR model, together with the corresponding category.
Table A1. The 260 most important words for the LR model, together with the corresponding category.
WordCategory
foodAgrifood
energyGreen Energy
buildingEnvironment
solarGreen Energy
heatGreen Energy
productAgrifood
plantAgrifood
patientBiomed
cellBiomed
batteryGreen Energy
wineAgrifood
waterEnvironment
electricityGreen Energy
constructionEnvironment
componentPackaging
oilAgrifood
animalAgrifood
userElectronics
objectPackaging
materialPackaging
particleBasic Science
extractAgrifood
concreteEnvironment
signalElectronics
networkElectronics
informationElectronics
dataElectronics
photovoltaicGreen Energy
tissueBiomed
powerGreen Energy
thermalGreen Energy
fuelGreen Energy
packagingPackaging
vegetableAgrifood
gasGreen Energy
panelEnvironment
milkAgrifood
solventBasic Science
elementEnvironment
electricalGreen Energy
tumorBiomed
seismicEnvironment
communicationElectronics
circuitElectronics
wasteAgrifood
piecePackaging
opticalElectronics
machinePackaging
oliveAgrifood
diseaseBiomed
steelEnvironment
farmAgrifood
softwareElectronics
deviceElectronics
edibleAgrifood
liquidBasic Science
chemicalBasic Science
bloodBiomed
networkGreen Energy
fruitAgrifood
toolPackaging
structureEnvironment
processBasic Science
structuralEnvironment
materialBasic Science
flowGreen Energy
mechanicalPackaging
jointPackaging
stiffnessPackaging
diagnosisBiomed
electromagneticElectronics
paperPackaging
bottleAgrifood
metalBasic Science
detectorBasic Science
surgicalBiomed
systemGreen Energy
acidAgrifood
imageElectronics
drugBiomed
framePackaging
siteEnvironment
radarEnvironment
humanBiomed
waterGreen Energy
nanoparticlesBasic Science
polymerBasic Science
packagePackaging
microorganismAgrifood
environmentalEnvironment
airEnvironment
wallEnvironment
tagPackaging
transmissionElectronics
pesticideAgrifood
sensorEnvironment
marineEnvironment
contentAgrifood
generatorGreen Energy
monitoringEnvironment
sampleBasic Science
ionBasic Science
exchangerGreen Energy
shaftPackaging
reactorGreen Energy
stingerPackaging
combustionGreen Energy
starchAgrifood
highBasic Science
polymericPackaging
lithiumGreen Energy
robotPackaging
systemPackaging
efficiencyGreen Energy
mortarEnvironment
clinicalBiomed
anchorEnvironment
integratePackaging
laserPackaging
forcePackaging
industrialPackaging
invasiveBiomed
randomElectronics
remoteElectronics
treatmentBiomed
quantumElectronics
rfidPackaging
propertyBasic Science
chainPackaging
handElectronics
compoundBasic Science
diagnosticBiomed
therapeuticBiomed
partPackaging
coliAgrifood
limbPackaging
hydrogenGreen Energy
defectPackaging
microalgaeAgrifood
phBasic Science
codeElectronics
ceramicBasic Science
gridGreen Energy
colorAgrifood
printingPackaging
strainAgrifood
maintenanceEnvironment
droneElectronics
pathologyBiomed
screwPackaging
boneBiomed
antennaElectronics
fiberPackaging
mushroomAgrifood
testEnvironment
positionElectronics
geneBiomed
compositeBasic Science
ultrasoundBiomed
cancerBiomed
flightPackaging
generationGreen Energy
wearableEnvironment
sludgeEnvironment
manufacturingPackaging
automotivePackaging
moduleElectronics
orchardAgrifood
fatAgrifood
therapyBiomed
tunnelGreen Energy
cycleGreen Energy
modeElectronics
surfaceBasic Science
inspectionEnvironment
fieldBasic Science
brainBiomed
loadEnvironment
electronicElectronics
tanningPackaging
cellGreen Energy
beamBasic Science
airGreen Energy
eventElectronics
produceGreen Energy
fiberBasic Science
enzymeBasic Science
membraneBasic Science
systemElectronics
precursorBasic Science
surgeryBiomed
polymerPackaging
conversionGreen Energy
biomassGreen Energy
oxygenPackaging
seedAgrifood
syngasGreen Energy
grippingPackaging
kinematicPackaging
hotPackaging
algorithmElectronics
biomassAgrifood
turbineGreen Energy
emissionEnvironment
materialEnvironment
virtualElectronics
moleculeBiomed
productionAgrifood
cementEnvironment
additivePackaging
outputElectronics
goodAgrifood
riskBiomed
environmentalAgrifood
cheeseAgrifood
naturalAgrifood
steamGreen Energy
biodegradablePackaging
infectionBiomed
polyurethanePackaging
graphElectronics
bitElectroncis
constraintPackaging
connectGreen Energy
heartBiomed
recoveryEnvironment
areaElectronics
groundEnvironment
criticalPackaging
receiveElectronics
modelElectronics
traditionalEnvironment
insertPackaging
activityBiomed
radioElectronics
stereolithographyPackaging
meanEnvironment
automaticEnvironment
treatmentBasic Science
implementElectronics
capablePackaging
packGreen Energy
hmdPackaging
characteristicAgrifood
foamBasic Science
conductivePackaging
reactionBasic Science
electricPackaging
specificBiomed
voltageGreen Energy
reactorBasic Science
vitroBiomed
noiseElectronics
soilAgrifood
armPackaging
circularPackaging
plenopticBasic Science
reinforcementEnvironment
radiationBasic Science
roofEnvironment

Appendix B

This appendix shows two examples of patents:
1. ITALIAN: Sistema mobile per la sicurezza alimentare, apparecchio per il trattamento di una sostanza campione, in particolare un simulante di una sostanza alimentare, per la realizzazione di un contatto diretto tra tale sostanza campione e una superficie di contatto in condizioni controllate. Tutti i materiali e gli oggetti destinati a venire a contatto con gli alimenti sono designati con l’acronimo moca, mentre il settore di riferimento è designato con il termine inglese food packaging. Il problema tecnico posto e risolto dalla presente invenzione è quindi quello di fornire un apparato per il trattamento di una sostanza campione per la realizzazione di un contatto diretto tra tale sostanza campione e una superficie di contatto in condizioni controllate, per l’analisi di un’interazione per contatto tra la sostanza e la superficie. L’apparecchiatura comprende: (a) una camera di reazione adatta a contenere la sostanza campione; (b) la camera di reazione è delimitata da un ambiente di reazione interno e avente almeno un’apertura; (c) l’apertura comprende una predisposizione per la realizzazione di una connessione sigillata tra la camera di reazione e la superficie di contatto; (d) sensori per la misura dei relativi parametri dell’atmosfera di reazione; (e) strumenti per consentire la regolazione dei parametri dell’atmosfera di reazione. Nell’intero settore delle tecnologie alimentari e della produzione alimentare. L’apparecchio consente di eseguire i test di migrazione sul posto, direttamente sulla superficie del macchinario della linea di produzione; il test viene eseguito in condizioni controllate secondo i parametri di legge; l’apparecchio è mobile, quindi può essere trasportato facilmente in qualsiasi punto della linea di produzione.
1. ENGLISH: Mobile system for food safety, apparatus for treating a sample substance, in particular a simulant of a food substance, to enable direct contact between said sample substance and a contact surface under controlled conditions.
All materials and articles intended to come into contact with food are designated by the acronym MOCA (Materials and Objects in Contact with Food), while the relevant sector is referred to by the English term food packaging. The technical problem addressed and solved by the present invention is thus to provide an apparatus for treating a sample substance to enable direct contact between said sample and a contact surface under controlled conditions, for the purpose of analyzing the interaction by contact between the substance and the surface. The equipment comprises:
(a) a reaction chamber suitable for containing the sample substance; (b) the reaction chamber is enclosed within a reaction environment and has at least one opening; (c) the opening includes a configuration for establishing a sealed connection between the reaction chamber and the contact surface; (d) sensors for measuring the relevant parameters of the reaction atmosphere; (e) instruments to allow the adjustment of the parameters of the reaction atmosphere. Applicable across the entire field of food technology and food production. The apparatus allows for on-site migration tests, directly on the surface of machinery within the production line; the test is performed under controlled conditions in accordance with legal parameters; the apparatus is mobile, and can therefore be easily transported to any point of the production line.
2. ITALIAN: sistema di rilevamento di deviazioni di forma 3D in linea di produzione controlli di qualità. Sistema utilizzabile in linea di produzione per verifiche real-time delle deviazioni e scalabile per la verifica di oggetti di qualunque dimensione. Il sistema rileva le deviazioni della forma di un oggetto tridimensionale rispetto alle specifiche di progetto grazie all’utilizzo di tecniche di proiezione di frange e di phase-shift. Il sistema è utilizzabile con oggetti di qualunque dimensione, ma si presta particolarmente ad oggetti di grandi dimensioni, da 0.5 m × 0.5 m a 2 m × 2 m, e oltre. Il sistema è composto da un telaio o cella (1), all’interno del quale viene posizionato il pezzo in esame (3) su di un pianale (2). il pezzo è inquadrato da un certo numero di telecamere (4) e contemporaneamente illuminato da 4 proiettori (5). Tutti i sensori, in posizione fissa, inquadrano tutto il pezzo, minimizzando o annullando le eventuali zone d’ombra o sottosquadri altrimenti non acquisibili. L’acquisizione delle immagini, la generazione del pattern dei proiettori e l’elaborazione avviene tramite computer (6). Caratteristica peculiare del sistema è la possibilità di acquisire e processare fino a 10 pezzi al minuto su linea di produzione. misure ottiche; industria automotiva; piccoli elettrodomestici; grandi elettrodomestici; aerospaziale; controllo qualità in linea. acquisizione della forma 3D in tempo reale su una linea da 10 pezzi al minuto; correzione automatica ed in real-time dell’errore di posizionamento del pezzo sulla linea di produzione; scalabilità della precisione di calibrazione.
2. ENGLISH: System for detecting 3D shape deviations in a production line for quality control. The system can be used directly on the production line for real-time verification of deviations and is scalable to inspect objects of any size. It detects deviations in the shape of a three-dimensional object from design specifications by using fringe projection and phase-shift techniques. The system is suitable for objects of any size but is particularly effective for large items, ranging from 0.5 m × 0.5 m to 2 m × 2 m and beyond. It consists of a frame or inspection cell within which the object under examination is placed on a platform. The object is captured by multiple cameras and simultaneously illuminated by four projectors. All sensors are fixed in position and cover the entire object, minimizing or eliminating shadowed areas or undercuts that would otherwise not be acquired. Image acquisition, projector pattern generation, and data processing are managed by a computer. A key feature of the system is its ability to acquire and process up to ten pieces per minute directly on the production line. It is suitable for applications involving optical measurements and is used in the automotive, small and large appliance, and aerospace industries, ensuring in-line quality control. The system enables real-time 3D shape acquisition at a rate of ten parts per minute, allows automatic and real-time correction of positioning errors of the object on the production line, and supports scalable calibration accuracy.

References

  1. Council Recommendation on the Guiding Principles for Knowledge Valorisation, 2022 Interinstitutional File: 2022/0233(NLE). Available online: https://eur-lex.europa.eu/eli/reco/2022/2415/oj/eng (accessed on 15 May 2023).
  2. EU Valorization Policy. 2020. Available online: https://research-and-innovation.ec.europa.eu/system/files/2020-03/ec_rtd_valorisation_factsheet.pdf (accessed on 16 May 2023).
  3. Trappey, A.; Trappey, C.V.; Hsieh, A. An intelligent patent recommender adopting machine learning approach for natural language processing: A case study for smart machinery technology mining. Technol. Forecast. Soc. Change 2021, 164, 120511. [Google Scholar] [CrossRef]
  4. Lybbert, T.J.; Zolas, N.J. Getting patents and economic data to speak to each other: An ‘algorithmic links with probabilities’ approach for joint analyses of patenting and economic activity. Res. Policy 2014, 43, 530–542. [Google Scholar] [CrossRef]
  5. Asche, G. “80% of technical information found only in patents”—Is there proof of this? World Pat. Inf. 2017, 48, 16–28. [Google Scholar] [CrossRef]
  6. Giordano, V.; Chiarello, F.; Melluso, N.; Fantoni, G.; Bonaccorsi, A. Text and dynamic network analysis for measuring technological convergence: A case study on defense patent data. IEEE Trans. Eng. Manag. 2021, 70, 1490–1503. [Google Scholar] [CrossRef]
  7. Puccetti, G.; Giordano, V.; Spada, I.; Chiarello, F.; Fantoni, G. Technology identification from patent texts: A novel named entity recognition method. Technol. Forecast. Soc. Change 2023, 186, 122160. [Google Scholar] [CrossRef]
  8. Daim, T.U.; Rueda, G.; Martin, H.; Gerdsri, P. Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technol. Forecast. Soc. Change 2006, 73, 981–1012. [Google Scholar] [CrossRef]
  9. Huang, Y.; Zhu, F.; Porter, A.L.; Zhang, Y.; Zhu, D.; Guo, Y. Exploring technology evolution pathways to facilitate technology management: From a technology life cycle perspective. IEEE Trans. Eng. Manag. 2020, 68, 1347–1359. [Google Scholar] [CrossRef]
  10. Altuntas, S.; Dereli, T.; Kusiak, A. Forecasting technology success based on patent data. Technol. Forecast. Soc. Change 2015, 96, 202–214. [Google Scholar] [CrossRef]
  11. Kim, G.; Bae, J. A novel approach to forecast promising technology through patent analysis. Technol. Forecast. Soc. Change 2017, 117, 228–237. [Google Scholar] [CrossRef]
  12. Kyebambe, M.N.; Cheng, G.; Huang, Y.; He, C.; Zhang, Z. Forecasting emerging technologies: A supervised learning approach through patent analysis. Technol. Forecast. Soc. Change 2017, 125, 236–244. [Google Scholar] [CrossRef]
  13. Karvonen, M.; Kässi, T. Patent citations as a tool for analysing the early stages of convergence. Technol. Forecast. Soc. Change 2013, 80, 1094–1107. [Google Scholar] [CrossRef]
  14. Abbas, A.; Zhang, L.; Khan, S.U. A literature review on the state-of-the-art in patent analysis. World Pat. Inf. 2014, 37, 3–13. [Google Scholar] [CrossRef]
  15. Ernst, H. Patent information for strategic technology management. World Pat. Inf. 2003, 25, 233–242. [Google Scholar] [CrossRef]
  16. Thorleuchter, D.; Van den Poel, D.; Prinzie, A. A compared R&D-based and patent-based cross impact analysis for identifying relationships between technologies. Technol. Forecast. Soc. Change 2010, 77, 1037–1050. [Google Scholar]
  17. Aristodemou, L.; Tietze, F. The state-of-the-art on Intellectual Property Analytics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data. World Pat. Inf. 2018, 55, 37–51. [Google Scholar] [CrossRef]
  18. Tseng, F.M.; Hsieh, C.H.; Peng, Y.N.; Chu, Y.W. Using patent data to analyze trends and the technological strategies of the amorphous silicon thin-film solar cell industry. Technol. Forecast. Soc. Change 2011, 78, 332–345. [Google Scholar] [CrossRef]
  19. Trappey, A.J.; Chen, P.P.; Trappey, C.V.; Ma, L. A machine learning approach for solar power technology review and patent evolution analysis. Appl. Sci. 2019, 9, 1478. [Google Scholar] [CrossRef]
  20. Choi, Y.; Park, S.; Lee, S. Identifying emerging technologies to envision a future innovation ecosystem: A machine learning approach to patent data. Scientometrics 2021, 126, 5431–5476. [Google Scholar] [CrossRef]
  21. Yu, X.; Zhang, B. Obtaining advantages from technology revolution: A patent roadmap for competition analysis and strategy planning. Technol. Forecast. Soc. Change 2019, 145, 273–283. [Google Scholar] [CrossRef]
  22. Georgiou, K.; Mittas, N.; Ampatzoglou, A.; Chatzigeorgiou, A.; Angelis, L. Data-Oriented Software Development: The Industrial Landscape through Patent Analysis. Information 2023, 14, 4. [Google Scholar] [CrossRef]
  23. EPO. India and Europe Explore the Impact of Industry 4.0 on the Patent System; Technical Report; European Patent Office: Munich, Germany, 2016. [Google Scholar]
  24. Gubbi, J.; Buyya, R.; Marusic, S.; Palaniswami, M. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Gener. Comput. Syst. 2013, 9, 1645–1660. [Google Scholar] [CrossRef]
  25. Günther, W.A.; Mehrizi, M.H.R.; Huysman, M.; Feldberg, F. Debating big data: A literature review on realizing value from big data. J. Strateg. Inf. Syst. 2017, 26, 191–209. [Google Scholar] [CrossRef]
  26. Intellectual Property Action Plan Implementation. 2022. Available online: https://single-market-economy.ec.europa.eu/industry/strategy/intellectual-property/intellectual-property-action-plan-implementation_en (accessed on 16 May 2023).
  27. Bellantuono, L.; Monaco, A.; Amoroso, N.; Aquaro, V.; Bardoscia, M.; Loiotile, A.D.; Lombardi, A.; Tangaro, S.; Bellotti, R. Territorial bias in university rankings: A complex network approach. Sci. Rep. 2022, 12, 4995. [Google Scholar] [CrossRef]
  28. Demarinis Loiotile, A.; De Nicolò, F.; Agrimi, A.; Bellantuono, L.; La Rocca, M.; Monaco, A.; Pantaleo, E.; Tangaro, S.; Amoroso, N.; Bellotti, R. Best Practices in Knowledge Transfer: Insights from Top Universities. Sustainability 2022, 14, 15427. [Google Scholar] [CrossRef]
  29. Hu, T.; Zhang, Y. A spatial–temporal network analysis of patent transfers from US universities to firms. Scientometrics 2021, 126, 27–54. [Google Scholar] [CrossRef]
  30. Deng, W.; Ma, J. A knowledge graph approach for recommending patents to companies. Electron. Commer. Res. 2021, 22, 1435–1466. [Google Scholar] [CrossRef]
  31. Chen, H.; Deng, W. Interpretable patent recommendation with knowledge graph and deep learning. Sci. Rep. 2023, 13, 2586. [Google Scholar] [CrossRef]
  32. Lee, J.S.; Park, J.H.; Bae, Z.T. The effects of licensing-in on innovative performance in different technological regimes. Res. Policy 2017, 46, 485–496. [Google Scholar] [CrossRef]
  33. Wang, Y.; Ning, L.; Chen, J. Product diversification through licensing: Empirical evidence from Chinese firms. Eur. Manag. J. 2014, 32, 577–586. [Google Scholar] [CrossRef]
  34. McDevitt, V.L.; Mendez-Hinds, J.; Winwood, D.; Nijhawan, V.; Sherer, T.; Ritter, J.F.; Sanberg, P.R. More than money: The exponential impact of academic technology transfer. Technol. Innov. 2014, 16, 75–84. [Google Scholar] [CrossRef]
  35. Roessner, D.; Bond, J.; Okubo, S.; Planting, M. The economic impact of licensed commercialized inventions originating in university research. Res. Policy 2013, 42, 23–34. [Google Scholar] [CrossRef]
  36. ASTP Survey Report. 2019. Available online: https://www.astp4kt.eu//assets/documents/Report%20-%20ASTP%20Survey%20on%20KT%20Activities%20FY2019.pdf (accessed on 16 December 2022).
  37. Chen, J.; Chen, J.; Zhao, S.; Zhang, Y.; Tang, J. Exploiting word embedding for heterogeneous topic model towards patent recommendation. Scientometrics 2020, 125, 2091–2108. [Google Scholar] [CrossRef]
  38. Du, W.; Wang, Y.; Xu, W.; Ma, J. A personalized recommendation system for high-quality patent trading by leveraging hybrid patent analysis. Scientometrics 2021, 126, 9369–9391. [Google Scholar] [CrossRef]
  39. Haghighian Roudsari, A.; Afshar, J.; Lee, W.; Lee, S. PatentNet: Multi-label classification of patent documents using deep learning based language understanding. Scientometrics 2022, 27, 207–231. [Google Scholar] [CrossRef]
  40. Krestel, R.; Chikkamath, R.; Hewel, C.; Risch, J. A survey on deep learning for patent analysis. World Pat. Inf. 2021, 65, 102035. [Google Scholar] [CrossRef]
  41. Yun, J.; Geum, Y. Automated classification of patents: A topic modeling approach. Comput. Ind. Eng. 2020, 147, 106636. [Google Scholar] [CrossRef]
  42. Souza, C.M.; Meireles, M.R.; Almeida, P.E. A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 2021, 126, 135–156. [Google Scholar] [CrossRef]
  43. Gomez, J.C.; Moens, M.F. A survey of automated hierarchical classification of patents. In Professional Search in the Modern World: COST Action IC1002 on Multilingual and Multifaceted Interactive Information Access; Springer: Cham, Switzerland, 2014; pp. 215–249. [Google Scholar]
  44. Technology Transfer System Handbook. Available online: https://www.polito.it/imprese/trasferimento/TTS_handbook.pdf (accessed on 22 December 2022).
  45. Chowdhary, K.R. Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
  46. Schaffer, C. Selecting a classification method by cross-validation. Mach. Learn. 1993, 13, 135–143. [Google Scholar] [CrossRef]
  47. Wu, X.Z.; Zhou, Z.H. A unified view of multi-label performance measures. In Proceedings of the International Conference on Machine Learning (PMLR), Sydney, NSW, Australia, 6–11 August 2017; pp. 3780–3788. [Google Scholar]
  48. Ebert, S.; Müller, T.; Schütze, H. Lamb: A good shepherd of morphologically rich languages. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 742–752. [Google Scholar]
  49. Camacho-Collados, J.; Pilehvar, M.T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 2018, 63, 743–788. [Google Scholar] [CrossRef]
  50. Gerlach, M.; Shi, H.; Amaral, L.A.N. A universal information theoretic approach to the identification of stopwords. Nat. Mach. Intell. 2019, 1, 606–612. [Google Scholar] [CrossRef]
  51. Sarica, S.; Luo, J. Stopwords in technical language processing. PLoS ONE 2021, 16, e0254937. [Google Scholar] [CrossRef] [PubMed]
  52. Jivani, A.G. A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl. 2011, 2, 1930–1938. [Google Scholar]
  53. Hassani, K.; Lee, W.S. Visualizing natural language descriptions: A survey. ACM Computing Surveys (CSUR) 2016, 49, 1–34. [Google Scholar] [CrossRef]
  54. Baeza-Yates, R.; Ribeiro-Neto, B. Modern Information Retrieval; ACM Press: New York, NY, USA, 1999; Volume 463. [Google Scholar]
  55. Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
  56. Li, S.; Hu, J.; Cui, Y.; Hu, J. DeepPatent: Patent classification with convolutional neural networks and word embedding. Scientometrics 2018, 117, 721–744. [Google Scholar] [CrossRef]
  57. Jung, G.; Shin, J.; Lee, S. Impact of preprocessing and word embedding on extreme multi-label patent classification tasks. Appl. Intell. 2023, 53, 4047–4062. [Google Scholar] [CrossRef]
  58. Hu, J.; Li, S.; Hu, J.; Yang, G. A hierarchical feature extraction model for multi-label mechanical patent classification. Sustainability 2018, 10, 219. [Google Scholar] [CrossRef]
  59. Zuva, K.; Zuva, T. Evaluation of information retrieval systems. Int. J. Comput. Sci. Inf. Technol. 2012, 4, 35. [Google Scholar] [CrossRef]
  60. Lee, J.S.; Hsiang, J. Patent classification by fine-tuning BERT language model. World Pat. Inf. 2020, 61, 101965. [Google Scholar] [CrossRef]
  61. Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
  62. Shi, T.; Horvath, S. Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 2006, 15, 118–138. [Google Scholar] [CrossRef]
  63. Rameshbhai, C.J.; Paulose, J. Opinion mining on newspaper headlines using SVM and NLP. Int. J. Electr. Comput. Eng. (IJECE) 2019, 9, 2152–2163. [Google Scholar] [CrossRef]
  64. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  65. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
  66. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2, pp. 1–758. [Google Scholar]
  67. Hilbe, J.M. Logistic Regression Models; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
  68. Chauhan, V.K.; Dahiya, K.; Sharma, A. Problem formulations and solvers in linear SVM: A review. Artif. Intell. Rev. 2019, 52, 803–855. [Google Scholar] [CrossRef]
  69. Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
  70. Guo, X.; Yin, Y.; Dong, C.; Yang, G.; Zhou, G. On the class imbalance problem. In Proceedings of the 2008 Fourth International Conference on Natural Computation, Jinan, China, 18–20 October 2008; IEEE: Piscataway, NJ, USA, 2008; Volume 4, pp. 192–201. [Google Scholar]
  71. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  72. Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed]
  73. Fall, C.J.; Törcsvári, A.; Benzineb, K.; Karetka, G. Automated categorization in the international patent classification. In ACM SIGIR Forum; ACM: New York, NY, USA, 2003; Volume 37, pp. 10–25. [Google Scholar]
  74. Hepburn, J. Universal language model fine-tuning for patent classification. In Proceedings of the Australasian Language Technology Association Workshop, Dunedin, New Zealand, 10–12 December 2018; pp. 93–96. [Google Scholar]
  75. Verberne, S.; D’hondt, E.K.L.; Oostdijk, N.H.J.; Koster, C.H. Quantifying the challenges in parsing patent claims. In Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval at ECIR 2010, Milton Keynes, UK, 28–31 March 2010. [Google Scholar]
  76. Loiotile, A.D.; De Nicolò, F.; Monaco, A.; Tangaro, S.; Loccisano, S.; Conti, G.; Agrimi, A.; Amoroso, N.; Bellotti, R. Innovations and Emerging Technologies: A Study of the Italian Intellectual Property Knowledge Database. In Proceedings of the ICAART, Lisbon, Portugal, 22–24 February 2023. [Google Scholar]
Figure 1. The figure shows a snapshot of the KS platform with its main features: the possibility to query patents by keywords, research centers, and so on.
Figure 1. The figure shows a snapshot of the KS platform with its main features: the possibility to query patents by keywords, research centers, and so on.
Sustainability 17 06425 g001
Figure 2. KS percentage distribution of patents across the different technological domains.
Figure 2. KS percentage distribution of patents across the different technological domains.
Sustainability 17 06425 g002
Figure 3. Overall flowchart of the proposed methodology. TFIDF (Term Frequency and Inverse Document Frequency) was used for language analysis, then several machine learning algorithms, such as Logistic Regression, Random Forest and Support Vector Machine were used for patent classification.
Figure 3. Overall flowchart of the proposed methodology. TFIDF (Term Frequency and Inverse Document Frequency) was used for language analysis, then several machine learning algorithms, such as Logistic Regression, Random Forest and Support Vector Machine were used for patent classification.
Sustainability 17 06425 g003
Figure 4. Notably, 5-fold cross-validation framework. Schematic overview of a single repetition of a 5-fold cross-validation framework. At every iteration, four folds, the training folds (in grey), are used to train a model, and the remaining one, the test fold (in blue) is used to validate it. Hence, after a cross-validation round, all the available instances are predicted. The overall performance of the model is found by averaging its performances on all the iterations. In order to have more robust results, we repeat this framework 100 times.
Figure 4. Notably, 5-fold cross-validation framework. Schematic overview of a single repetition of a 5-fold cross-validation framework. At every iteration, four folds, the training folds (in grey), are used to train a model, and the remaining one, the test fold (in blue) is used to validate it. Hence, after a cross-validation round, all the available instances are predicted. The overall performance of the model is found by averaging its performances on all the iterations. In order to have more robust results, we repeat this framework 100 times.
Sustainability 17 06425 g004
Figure 5. LR sorted importance for the first 1000 words. Notably, the trend shows a steep decrease followed by a large plateau after 260 words.
Figure 5. LR sorted importance for the first 1000 words. Notably, the trend shows a steep decrease followed by a large plateau after 260 words.
Sustainability 17 06425 g005
Figure 6. The partition of the 260 words into the 7 technological areas.
Figure 6. The partition of the 260 words into the 7 technological areas.
Sustainability 17 06425 g006
Table 1. Mean P @ 1 ¯ , R @ 1 ¯ and F 1 @ 1 ¯ values for the classification algorithms. The corresponding standard deviations are reported in brackets. F 1 @ 1 ¯ values are in bold.
Table 1. Mean P @ 1 ¯ , R @ 1 ¯ and F 1 @ 1 ¯ values for the classification algorithms. The corresponding standard deviations are reported in brackets. F 1 @ 1 ¯ values are in bold.
MetricRCLRRFSVMXGB
P @ 1 ¯ 0.181 (0.022)0.801 (0.024)0.755 (0.026)0.793 (0.023)0.764 (0.023)
R @ 1 ¯ 0.139 (0.017)0.679 (0.022)0.645 (0.024)0.674 (0.022)0.648 (0.021)
F 1 @ 1 ¯ 0.157 (0.019)0.735 (0.022)0.695 (0.024)0.728 (0.022)0.701 (0.021)
Table 2. Mean P @ 2 ¯ , R @ 2 ¯ and F 1 @ 2 ¯ values for the classification algorithms. The corresponding standard deviations are reported in brackets. F 1 @ 2 ¯ values are in bold.
Table 2. Mean P @ 2 ¯ , R @ 2 ¯ and F 1 @ 2 ¯ values for the classification algorithms. The corresponding standard deviations are reported in brackets. F 1 @ 2 ¯ values are in bold.
MetricRCLRRFSVMXGB
P@20.184 (0.013))0.544 (0.013) 0.503 (0.016)0.518 (0.016)0.508 (0.016)
R@20.286 (0.022)0.867 (0.016)0.813 (0.022)0.830 (0.018)0.818 (0.021)
F1@20.224 (0.019)0.669 (0.013)0.622 (0.018)0.637 (0.016)0.627 (0.017)
Table 3. Mean frequencies of confusion among categories (in percentage). The frequencies of correct labelling, in bold, are reported in the main diagonal. The corresponding standard deviations are in brackets.
Table 3. Mean frequencies of confusion among categories (in percentage). The frequencies of correct labelling, in bold, are reported in the main diagonal. The corresponding standard deviations are in brackets.
Predicted LabelAgrifoodEnvironmentBasic ScienceGreen EnergyElectronicsPackagingBiomed
True Label
Agrifood68.1 (9.1)6.2 (3.3)7.2 (2.1)2.1 (1.0)5.1 (2.2)1.1 (0.4)10.2 (4.1)
Environment4.5 (3.2)42.1 (8.2)19.3 (6.2)13.4 (4.4)13.2 (4.1)6.2 (2.2)3.1 (1.0)
Basic Science6.2 (2.1)6.7 (2.2)51.1 (5.1)6.2 (3.1)7.6 (2.1)3.1 (1.0)20.4 (4.1)
Green Energy3.8 (1.1)11.2 (2.3)11.2 (3.2)56.9 (8.2)13.2 (5.3)2.1 (0.5)3.4 (1.1)
Electronics1.8 (0.2)4.5 (0.8)6.5 (1.9)6.8 (1.8)64.3 (5.2)5.6 (1.2)14.5 (2.2)
Packaging8.7 (1.3)8.2 (1.2)23.8 (3.2)5.2 (1.4)23.9 (8.4)24.8 (2.1)10.8 (2.1))
Biomed1.1 (0.1)1.8 (0.2)8.8 (1.1)1.6 (0.1)9.7 (1.1)2.8 (0.7)77.8 (2.3)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Amoroso, N.; Demarinis Loiotile, A.; Pantaleo, E.; Conti, G.; Loccisano, S.; Tangaro, S.; Monaco, A.; Bellotti, R. An Italian Patent Multi-Label Classification System to Support the Innovation Demand and Supply Matching. Sustainability 2025, 17, 6425. https://doi.org/10.3390/su17146425

AMA Style

Amoroso N, Demarinis Loiotile A, Pantaleo E, Conti G, Loccisano S, Tangaro S, Monaco A, Bellotti R. An Italian Patent Multi-Label Classification System to Support the Innovation Demand and Supply Matching. Sustainability. 2025; 17(14):6425. https://doi.org/10.3390/su17146425

Chicago/Turabian Style

Amoroso, Nicola, Annamaria Demarinis Loiotile, Ester Pantaleo, Giuseppe Conti, Shiva Loccisano, Sabina Tangaro, Alfonso Monaco, and Roberto Bellotti. 2025. "An Italian Patent Multi-Label Classification System to Support the Innovation Demand and Supply Matching" Sustainability 17, no. 14: 6425. https://doi.org/10.3390/su17146425

APA Style

Amoroso, N., Demarinis Loiotile, A., Pantaleo, E., Conti, G., Loccisano, S., Tangaro, S., Monaco, A., & Bellotti, R. (2025). An Italian Patent Multi-Label Classification System to Support the Innovation Demand and Supply Matching. Sustainability, 17(14), 6425. https://doi.org/10.3390/su17146425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop