Technology Clusters Exploration for Patent Portfolio through Patent Abstract Analysis

This study explores technology clusters through patent analysis. The aim of exploring technology clusters is to grasp competitors’ levels of sustainable research and development (R&D) and establish a sustainable strategy for entering an industry. To achieve this, we first grouped the patent documents with similar technologies by applying affinity propagation (AP) clustering, which is effective while grouping large amounts of data. Next, in order to define the technology clusters, we adopted the term frequency-inverse document frequency (TF-IDF) weight, which lists the terms in order of importance. We collected the patent data of Korean electric car companies from the United States Patent and Trademark Office (USPTO) to verify our proposed methodology. As a result, our proposed methodology presents more detailed information on the Korean electric car industry than previous studies.


Introduction
In recent years, companies and countries have been responding to complicated and rapid changes in technology and are hence, focusing on establishing structured research development systems to strengthen their capacity for future technology forecasts.In addition, companies and countries are required to continuously invest in technology development to support their sustainable competitiveness [1].Thus, entering an industry and establishing sustainable business strategies, based on a structured strategy of forecasting promising technology areas and monitoring competitors, is an important issue.In the past, qualitative analysis techniques, such as the Delphi technique, were the main method of technology forecasting and facilitated decisions in establishing a product's improvement or the development plan for new products, as well as grasping technology alternatives and the appropriate time to introduce new technology [2,3].However, such a method has its disadvantages.First, there is the possibility of ignoring potential relationships between the forecasted events and being influenced by the subjective opinions of the experts.Second, it is costly and time consuming.To supplement these disadvantages, the monitoring of competitors and the study of technology forecasting are extensively used.These studies are performed through patent portfolio development by applying a variety of statistic and data mining techniques on patents that contain a detailed explanation of technology.Patent documents include potential information on developed technologies, such as a summary of the technology and information on inventors and applicants.Such patent information, with its explanation of technology and innovative activity, is a very important source in areas of technology development strategy and market transaction [4].Therefore, utilizing patent information appropriately is crucial to technology development as well as business management.Exploring technology clusters in similar technology, grouped by technology similarities in patent documents, is an especially important step in discovering new businesses and understanding the competition in the areas where the business operates.Furthermore, this step contributes to identifying and entering new and inadequately explored technology fields in order to expand the diversification of the business.This research explores the appearance of a new area of technology, which is derived from technology fusion in the patent information of companies in specific technology areas.It also studies which technology areas are to be focused on for a business to survive in a continuously growing industry.This research is separated into two sections-the formation of technology clusters through patent documents analysis, and the extraction of core keywords to define the formed technology clusters.By doing so, we suggest a sustainable strategy to create a lead to enter a new technology industry and discover new business opportunities.In order to achieve this sustainable strategy, this paper presents a methodology to form technology clusters through affinity propagation (AP) clustering based on message passing, wherein patent documents exchange similarities by message.AP is appropriate for patent data, which includes extensive text information, because it clusters the data by measuring similarities in a complicated and abundant amount of data.Furthermore, we extract core keywords of technology clusters through the term frequency-inverse document frequency (TF-IDF) weight in the process of defining the formed technology clusters.We explore the technology clusters from Korean companies in the electric car technology field by analyzing their patent applications with the United States Patent and Trademark Office (USPTO) [5].From this analysis, we propose a portfolio facilitating new technology development and a sustainable strategy to enter a new industry by forming technology clusters for each business in the area of electric car technology.

Patent Research and Analysis
Much patent research and analysis has been conducted to satisfy patent analysis experts, managers of companies, and their various demands [6].Patent documents include some information about citation, applicant, inventor, International Patent Classification (IPC), and an abstract indicating a detailed description of the invention.Researching and analyzing this information holds great potential to fully understand the technology levels and innovation trends in countries around the world [7].In this information, citation can be used to identify a value or innovative activity in a company [8,9].In addition to citation, there is applicant and inventor information, which shows relative technology leadership in certain areas [10].Moreover, it is possible to visually realize the influence of technology through network analysis by using IPC information [11].The abstract in the patent document is especially important information since it provides a detailed description of the technology.However, since the abstract is unstructured data that is hard to directly analyze, it is necessary to transform it into structured data using the text mining technique.The keywords thus extracted from the patent contain core technology, product, components, and methodology of the technology [4].In this study, we apply the text mining technique to analyze the abstract of the patent to explore the technology clusters.

Strategy of the Exploring Technology Clusters
In order to establish sustainable research and development strategies, it is important to identify potential future technology and grasp the current level of technology development in a rival company by exploring the technology cluster.Exploring technology clusters by using patent information, which has been approached in various ways by many researchers, and text mining techniques is generally executed in two stages.
The first stage is to form technology clusters from patent documents of similar technologies.Lee et al. [12] transformed unstructured patent data into structured data using the text mining technique.This was followed by principle component analysis (PCA), which was adopted to reduce the high dimensionality of patent data and to map the patent documents.Kim et al. [13], Jun et al. [14] and Jun [11] developed the technology clusters using k-means clustering based on Euclidean distance, and Jun et al. [15] formed technology clusters using k-medoids clustering.However, there are some limitations-PCA is inefficient for patent mapping in terms of visualization; k-means clustering and k-medoids clustering necessitate the determination of the number of clusters beforehand, and the results of k-means clustering are greatly influenced by the value of the initial cluster center.
The next stage is an interpretation of the formed technology clusters.For this stage, there is a method using terms in the clusters and IPC, indicating the technology classification.The technologies are defined after determining keywords that occur together most frequently in the formed clusters [14][15][16].Jun and Park [11] also defined the representative technology of the clusters as the most frequent IPC.Kim et al. [13] suggested the patent map of a semantic network based on the terms' frequency and filing data.However, there are some limitations to defining technology because the definition of technology with unfiltered terms or IPC frequency could contain noise and false or unspecific terms.Therefore, it is necessary to have a more improved and objective approach to exploring technology clusters in order to overcome these limitations.

Formation of Technology Clusters
We consider the abstract of the patent document, which describes the technology in detail, for grouping the patent documents with similar technologies.However, as the abstract text is unstructured data, it needs to be transformed into structured data through the text mining technique.In this technique, there is a text normalization process that filters out unnecessary data from information retrieval [17].The process includes removing blanks, punctuation marks and stop-words such as articles; prepositions; and conjunctions, and changing capital letters to lower case.Following this, the well-organized terms are transformed into a patent-term matrix (PTM) which consists of patent documents in rows and terms in columns.However, there is a problem of sparseness as there are too many terms as compared to the number of patent documents, and most elements are composed of 0. Therefore, multidimensional scaling (MDS) is conducted to overcome this sparseness and reduce the dimensionality.MDS is a method of expressing data as dots in two-or three-dimensional spaces after measuring the characteristics of the data and its similarity or dissimilarity [18].Thus, the objective of MDS is to visually project the distance between data or similarities on low dimensional space.The distance between n data is the measured dissimilarity of that data.The detailed process is as follows: first, a distance matrix D (X) , which is the n × n symmetric matrix of pair-wise dissimilarities, is extracted from the PTM, having n documents and m terms.D (X) having the pair-wise dissimilarities is based on Euclidean distance.When D (X) is given, n patent documents are compressed in a lower p dimensional space, which is an n × p configuration matrix.For this, the distance matrix is converted to Gram matrix, as follows: where, H = I − ee T /n, I is the identity matrix and e is the vector of ones.K = V −1 ΛV is decomposed using singular value decomposition (SVD), where is the matrix of normalized eigenvectors.Then, λ p , which is the pth biggest eigenvalue in matrix K, becomes an axis of lower dimensional space.Hence, it is possible to get a patent-dimension configuration matrix (PDCM) with p-dimensional space, which is lowered by the ith component v i of the pth eigenvalue, as shown in Figure 1.Therefore, patent documents are mapped into two-or three-dimensional space.Although three-dimensional space can include more information than two-dimensional space, it leads to a misunderstanding of the internal features of technologies or to a vague relationship between technologies because of dimensional complexity [19].In this study, to cluster patent documents with similar technologies, we map patent documents in two-dimensional space, which grasps technology relationships clearly.Next, we group the patent documents with similar technology content using AP clustering, which is efficient for big data clustering and does not have to select the initial values of each cluster.AP clustering is a technique based on message passing, which refers to exchanging messages that reflect the similarity of the data points [20].Exemplars, representing data points, are chosen for each point by exchanging messages.Then each cluster is set by the exemplar.Every data point can be a candidate of exemplar.The AP clustering process proceeds, as shown in Figure 2. First, the similarity between data point and is extracted as an input value.As similarity ( , ) is an index for showing how data point is appropriate for being an exemplar of data point , Euclidean distance is used to measure the similarity.Instead of choosing the number of clusters beforehand, AP shows a tendency to be chosen as an exemplar by ( , ) for each data point .( , ) is the preference, and the number of exemplars are influenced by the preferences.AP is operated by the processes of the data point working as a network node, and repeatedly exchanging messages from node to node.There are two kinds of messages: responsibility and availability.Responsibility ( , ) is a message sent from data point to , which indicates how data point is a proper candidate of exemplar for data point .Availability ( , ) is a message sent from the data point to , indicating a possibility that data point is selected as an exemplar of data point .
is a damping factor.When responsibilities and availabilities are updated, oscillations arise that cause overshooting.To solve this problem, a damping factor is assigned to the old message, which updates the new message so that oscillation can be removed.This process is repeated until the message changes below a threshold or prearranged repeat count.As a result, the value of that maximizes the sum of availability and responsibility is an exemplar of data .

Definition of Technology Clusters
In order to define the technology in the technology clusters, we apply TF-IDF weight to extract core keywords from the PTM of the technology cluster.TF-IDF weight is a numerical statistic that indicates an ordered importance of terms in the fields of information retrieval and text mining [21].Term frequency (TF) indicates the number of times each term appears in each document.Inverse document frequency (IDF) is a numerical value that shows unique terms by measuring how many times a certain term occurs in multiple documents, as follows: where, represents the total number of documents and nm indicates the total number of documents including the term .As the IDF value rises, the term is considered unique in those documents.Therefore, TF-IDF weight is calculated by multiplying TF by IDF.Those with high Next, we group the patent documents with similar technology content using AP clustering, which is efficient for big data clustering and does not have to select the initial values of each cluster.AP clustering is a technique based on message passing, which refers to exchanging messages that reflect the similarity of the data points [20].Exemplars, representing data points, are chosen for each point by exchanging messages.Then each cluster is set by the exemplar.Every data point can be a candidate of exemplar.The AP clustering process proceeds, as shown in Figure 2. First, the similarity between data point i and k is extracted as an input value.As similarity s (i, k) is an index for showing how data point k is appropriate for being an exemplar of data point i, Euclidean distance is used to measure the similarity.Instead of choosing the number of clusters beforehand, AP shows a tendency to be chosen as an exemplar by s (k, k) for each data point k.s (k, k) is the preference, and the number of exemplars are influenced by the preferences.AP is operated by the processes of the data point working as a network node, and repeatedly exchanging messages from node to node.There are two kinds of messages: responsibility and availability.Responsibility r (i, k) is a message sent from data point i to k, which indicates how data point k is a proper candidate of exemplar for data point i.Availability a (i, k) is a message sent from the data point k to i, indicating a possibility that data point k is selected as an exemplar of data point i. lam is a damping factor.When responsibilities and availabilities are updated, oscillations arise that cause overshooting.To solve this problem, a damping factor is assigned to the old message, which updates the new message so that oscillation can be removed.This process is repeated until the message changes below a threshold or prearranged repeat count.As a result, the value of k that maximizes the sum of availability and responsibility is an exemplar of data i.

Definition of Technology Clusters
In order to define the technology in the technology clusters, we apply TF-IDF weight to extract core keywords from the PTM of the technology cluster.TF-IDF weight is a numerical statistic that indicates an ordered importance of terms in the fields of information retrieval and text mining [21].Term frequency (TF) indicates the number of times each term appears in each document.Inverse document frequency (IDF) is a numerical value that shows unique terms by measuring how many times a certain term occurs in multiple documents, as follows: where, N represents the total number of documents and n m nm indicates the total number of documents including the term m.As the IDF value rises, the term is considered unique in those documents.Therefore, TF-IDF weight is calculated by multiplying TF by IDF.Those with high TF-IDF weight are used as core keywords for defining the technology cluster.The overall proposed methodology is shown in Figure 3.
Sustainability 2016, 8, 1252 5 of 13 TF-IDF weight are used as core keywords for defining the technology cluster.The overall proposed methodology is shown in Figure 3.

Case Study: Korean Electronic Car Fields
To explore the technology clusters, we collected data of the electric car technology in a Korean company from USPTO.We assumed a company having more than ten patents in the United States as a leading company in electric vehicle technology and analyzed it.
In total, 655 patent documents were collected.Among these documents, Hyundai, Kia, LG, Mando, Samsung, LSIS, Halla, and SK held more than ten patents, as shown in Figure 4.The period of employed patent documents was from 2001 to 2013.We applied the text mining technique to the abstract of the patent document to form the technology clusters for each of the eight companies.Hyundai's dimension of the PTM was 273 × 2905, which consisted of 273 documents and 2905 terms.Other companies' dimensions were 62 × 1411 for Kia, 38 × 928 for LG, 31 × 798 for Mando, 20 × 538 for Samsung, 15 × 303 for LSIS, 13 × 386 for Halla, and 11 × 289 for SK.As shown in the Figure 1a above, the PTM of all companies is sparse because it is in high dimensionality and is largely composed of 0. Therefore, we applied MDS in order to overcome this sparseness problem.To achieve this, dissimilarity based on Euclidean distance was calculated, as shown in Figure 1b, and then high dimensionality of terms was reduced to two-dimensionality, as shown in Figure 1c.Next, AP clustering was conducted to form the technology clusters, after mapping the patent documents in two-dimensional space.There are two important parameters in AP clustering-input preference and lam.Input preference is correlated with the number of clusters; the number is determined by input preference.Generally, a successful choice for input preference is known to be a median of all the similarities among the data points [20].Therefore, in this study, we adopted an input preference value according to this theory.lam reduces oscillations; the range of lam is [0, 1], and this value can be increased if AP cannot converge because of oscillations.Generally, a high value of lam leads to steady convergence [20].In this study, lam was set to 0.9. Figure 5 shows the results of AP clustering after arranging each company's patent documents in two-dimensional spaces by MDS.
In Table 1, net similarity is a value that indicates how well exemplars explain the nearby data.Hence, AP could be an objective function that tends to make it to maximum value.We get optimum net similarity by applying input preference with various ranges of values.As a result, each technology cluster is developed for the companies as follows: 19 clusters for Hyundai, 8 for Kia, 7 for LG, 5 for Mando, 5 for Samsung, 3 for LSIS, 3 for Halla, and 3 for SK.For the next step, it was necessary to create a process for technology definition in the developed technology clusters.As shown in Tables 2-9, the patent documents sets of each technology cluster were transformed to the PTM.The technology definition was then inferred from ten core keywords in high value, which were extracted by TF-IDF weight.We requested experts in the electric car field to clearly define the technology through the extracted core keywords.

Case Study: Korean Electronic Car Fields
To explore the technology clusters, we collected data of the electric car technology in a Korean company from USPTO.We assumed a company having more than ten patents in the United States as a leading company in electric vehicle technology and analyzed it.
In total, 655 patent documents were collected.Among these documents, Hyundai, Kia, LG, Mando, Samsung, LSIS, Halla, and SK held more than ten patents, as shown in Figure 4.The period of employed patent documents was from 2001 to 2013.We applied the text mining technique to the abstract of the patent document to form the technology clusters for each of the eight companies.Hyundai's dimension of the PTM was 273 × 2905, which consisted of 273 documents and 2905 terms.Other companies' dimensions were 62 × 1411 for Kia, 38 × 928 for LG, 31 × 798 for Mando, 20 × 538 for Samsung, 15 × 303 for LSIS, 13 × 386 for Halla, and 11 × 289 for SK.As shown in the Figure 1a above, the PTM of all companies is sparse because it is in high dimensionality and is largely composed of 0. Therefore, we applied MDS in order to overcome this sparseness problem.To achieve this, dissimilarity based on Euclidean distance was calculated, as shown in Figure 1b, and then high dimensionality of terms was reduced to two-dimensionality, as shown in Figure 1c.Next, AP clustering was conducted to form the technology clusters, after mapping the patent documents in two-dimensional space.There are two important parameters in AP clustering-input preference and .Input preference is correlated with the number of clusters; the number is determined by input preference.Generally, a successful choice for input preference is known to be a median of all the similarities among the data points [20].Therefore, in this study, we adopted an input preference value according to this theory.
reduces oscillations; the range of is [0, 1], and this value can be increased if AP cannot converge because of oscillations.Generally, a high value of leads to steady convergence [20].In this study, was set to 0.9. Figure 5 shows the results of AP clustering after arranging each company's patent documents in two-dimensional spaces by MDS.
In Table 1, net similarity is a value that indicates how well exemplars explain the nearby data.Hence, AP could be an objective function that tends to make it to maximum value.We get optimum net similarity by applying input preference with various ranges of values.As a result, each technology cluster is developed for the companies as follows: 19 clusters for Hyundai, 8 for Kia, 7 for LG, 5 for Mando, 5 for Samsung, 3 for LSIS, 3 for Halla, and 3 for SK.For the next step, it was necessary to create a process for technology definition in the developed technology clusters.As shown in Tables 2-9, the patent documents sets of each technology cluster were transformed to the PTM.The technology definition was then inferred from ten core keywords in high value, which were extracted by TF-IDF weight.We requested experts in the electric car field to clearly define the technology through the extracted core keywords.The clustering result shows that Hyundai has 13.92% of multi-zone air conditioning, 7.33% of virtual driving system, 7.33% of vehicle interface devices, and 7.33% of air conditioning as high rate technologies, and 1.10% of heat pump and 1.83% of electric compressor as technologies that are less developed out of a total of 19 technology clusters.Kia is composed of a total of eight technology clusters, in which eco-routing navigation, creep torque control system, and lean burn drive mode technologies are mainly developed, and 6.45% of heating system for a fuel cell vehicle, 8.06% of control of a converter, and 9.68% of heating and cooling system using peltier device technologies are not well-developed.Therefore, it seems that Hyundai and Kia are producing and developing whole components of an electric car.LG's technology clusters comprise a battery monitoring device, parameter estimation of augmented state, and assembling the battery cell as high rate technologies whereas relay control apparatus, secondary battery, battery model, and driver circuit and diagnostic method are low rate technologies.In the case of Mando, recovery control function, hydraulic power steering, electric power steering technologies, column-type electric power steering, and electronic parking brake technologies are their main technology clusters.Samsung has technology clusters of integrated charging module device and a lithium battery pack as high rate technologies.For LSIS, the technology clusters consist of 31.25% of the charge management system, 37.5% of electric motor control, and 31.25% of supplying and blocking of the battery power.Halla has technology clusters that include 45.15% of heating; ventilation; and air conditioning (HVAC), 23.08% of battery cooling, and 30.77% of the air conditioning system for an electric car.Finally, SK concentrates on battery development, DC/DC converter, and pre-charge resistance protector as their main research and development strategy.
In order to verify the results of our analysis, we compared and analyzed it with the electric car industry and the technology trends report [22,23].In the report, technology trends and market situations of Korean electric car and electric car components, such as motor and battery, are indicated.In addition, the report is used as a guideline to grasp the technology situation in the Korean electric car industry.For this reason, we used the report to verify the results of our analysis.According to the report, in the past, Hyundai and Kia focused on producing the overall components of the electric car and fuel cell electric car, but have recently begun to focus on the mass production of electric cars.Samsung, LG, and SK are three big players in the battery market for the Korean electric car field; this matched considerably with our text mining results.LSIS focused on the changing infrastructure sector and motor development, and Halla focused on research and development of air conditioning systems.These results correspond to our analysis results.Mando develops the technology of regenerative brakes and chargers.Hence, we were able to find out that the results of our analysis closely matched the report; moreover, our results show the technology development status of each company in more detail.However, there were a few differences between the results of the patent analysis and the industry trends report, as patents usually take years to commercialize [24].In addition, as shown in Table 10, the technology trends reported suggested the concepts of technologies that are mostly applied to the final products, whereas patents mainly specified the concepts of components and elemental technologies.This, thus, implies that the present study can provide an important piece of information in predicting which technologies will be commercialized and in understanding their details.

Conclusions
In this paper, we proposed a method to explore technology clusters to develop an entry strategy by developing technology and monitoring competitor technology levels in the Korean electric car industry.To achieve this, this study included two kinds of themes, as follows: first, AP clustering was suggested for developing the technology clusters.In the process of developing the technology clusters, which group the patent documents with similar technologies, unstructured patent documents were transformed into structured data, and then the patent documents were mapped on low dimensional space using MDS.Next, TF-IDF weight was conducted for extracting core keywords in the process of defining the technology clusters.This study used AP clustering and TF-IDF especially to overcome the problems of prior studies, which are selecting the number of clusters in advance and defining the technology clusters based on the frequency of terms, including noise terms.With our case study, we were able to indicate the Korean companies' research situations and the direction of the electric car industry.This study is also expected to provide a patent portfolio, which is sustainable in developing research and development for companies that are about to enter the electric car industry.By exploring technology clusters of each company's patent information, we examined which technologies the companies have, and how these technologies are concentrated and distributed.This is expected to be helpful for improving competitive power and reducing the cost of sustainable research and development for many companies.In particular, compared to previous reports, the approach in this report made it possible to determine the technology trends in more detail, and predict the technologies that are likely to be commercialized in the future.Despite the comprehensive and effective aspects of our proposed method, there are still some limitations.First, AP clustering is sensitive to the number of clusters in accordance with a set of parameters.In addition, there was qualitative judgment from experts over the interpretation of the technology clusters by using extracted core keywords, even though we removed unnecessary terms by using TF-IDF weights.Therefore, future studies should combine more diverse and effective techniques of data mining and quantitative analysis to develop a more advanced and objective patent portfolio.

Figure 1 .
Figure 1.Process of Transforming the PTM into PDCM.

Figure 3 .
Figure 3. Overall process of proposed methodology.(a) Technology clusters formation; (b) Technology Definition.

Figure 4 .
Figure 4. Collected patent documents of Korean electric car fields from USPTO.

Figure 4 .
Figure 4. Collected patent documents of Korean electric car fields from USPTO.

Figure 5 .
Figure 5. Results of AP clustering of each company in two-dimensional space.Figure 5. Results of AP clustering of each company in two-dimensional space.

Figure 5 .
Figure 5. Results of AP clustering of each company in two-dimensional space.Figure 5. Results of AP clustering of each company in two-dimensional space.

Table 1 .
Factors of AP clustering of each company's technology.

Table 2 .
Technology clusters for Hyundai.

Table 3 .
Technology clusters for Kia.

Table 4 .
Technology clusters for LG.

Table 5 .
Technology clusters for Mando.

Table 6 .
Technology clusters for Samsung.

Table 8 .
Technology clusters for Halla.

Table 9 .
Technology clusters for SK.

Table 10 .
Korean electric car companies' technology trends in the trends report.