4.1. Topic Discovery with LDA
This study applied LDA to identify topics from financial BM patents [
80,
81]. For LDA, an empirical Bayes method for parameter estimation was used [
20]. A topic model was conducted based on the abstracts of financial BM patents using the statistical software R. Unstructured textual data contains abundant and valuable information related to the technologies of financial BM patents. To reduce irrelevant terms in the text data, uninformative elements, such as conjunctions, determiners, prepositions, and numbers, were ignored. Ninety-three high-frequency terms to eliminate general words in financial BMs were also removed. A term-by-document frequency matrix was constructed as the foundation for an analysis of the patent document collection, which enables LDA.
In order to decide the optimal number of topics, the perplexity measure is utilized. The perplexity means the decreasing function of the log-likelihood of the topic model given the estimated parameters [
82]. The perplexity has been traditionally used for choosing the optimal number of topics when the cross validation is applied to the topic model [
82]. The topics were chosen where the corresponding perplexities from the 10-fold cross-validation on the testing dataset were mostly low, as shown in
Figure 2. A lower perplexity score indicates better general performance.
Figure 2 shows the change in perplexity by the number of topics (K).
Figure 2 can be used to interpret approximately 25 different topics, so 26 was considered to be the maximum number of topics that are within the limits of possibility.
The resulting topics are listed in the table below. Each topic can be represented by “most likely occurring” terms and topic labels can be assigned using these terms. Based on the 20 words with the highest probability of occurrence for each topic, the topics are displayed in
Table 2.
Discovered topics are related to the investor system (topic 2, 3, 4, 5, 12, 15, 16, 24), traditional service & corporate finance (topic 1, 8, 14, 17, 23, 25), industry-related (topic 10, 18, 19, 22), individual services (topic 6, 7, 11), and new financial technology (topic 9, 13, 20, 21, 26). In particular, the topics on new financial technology seem to indicate the new electronic transactional and payment technique and services.
We also examined the topic relations by building a graph of topics containing co-occurrence terms. If two topics shared co-occurring terms, the two topics are considered to be related. The relations with less than three co-occurring terms are removed to simplify the graphs. The topic network can expand our understanding on the discovered topics, and the topic network-related variables are utilized in our survival analysis. In order to overview the topic relationship, the graph was also clustered through the walktrap algorithm [
83,
84].
Figure 3 shows the resulting graph. Five sub-graphs were found based on the co-occurrence of terms described in
Table 2.
The identified subgraphs are as follows. The first subgraph (in squares) is about banking and financing, and it covers seven topics: Topics 5, 6, 9, 10, 13, 19, and 25. They are related to taxes, wireless/mobile terminals for money transfers, machines and equipment in bank, processing loans, and financing vehicles. The second subgraph is related to the financial securities, such as stock and insurance, and it is located on the left-hand side and consists of Topics 1, 8, 22, and 23. These topics represent advisory services, brokerages, health insurance, and a system for stock documents. The third subgraph (in circles) in the center shows the investment system and new financial technology. It consists of 13 topics: Topics 2, 3, 5, 7, 11, 12, 15, 16, 17, 18, 20, 21, and 26. They are related to the trade system, the investor support platform, the software for hedging, the client interactions and reporting web systems, financial analysis for the individual, the automated trade strategies and algorithms, the engines for derivatives, the trading annuities, the simulation components for cash flows, the auctions, the electronic wallet services, the optimizing techniques for authentication via the electronic environment, and the direct pay system. The fourth subgraph consists of Topic 24 and the fifth represents Topic 4. The fourth subgraph is related to the equity analysis methodology, while the fifth subgraph is related to allocating the structure of financial income. Particularly, the first subgraph and the third subgraph have the overlapped area, which covers Topic 2, Topic 6, and Topic 16. Those overlapped topics are related to the trading system, and those can connect the banking and financing with the new financial technology and investment system.
4.3. Survival Analysis of Topic Emergence
In this section, we analyze the emerging trend of each topic over time. The trend of each topic was examined by obtaining the average probability of each topic by year. Each topic showed a positive slope or negative slope each year, indicating an increasing or decreasing trend for each topic in the financial BMs.
The overall trends of topics over time are observed. In
Figure 4, topic probability is on the y-axis and its variability decreases from past to present. To apply recent variability to the analysis, the recent probability of topics needs to be weighted based on the assumption that recent variations are more important than past variations. Thus, the probabilities of each topic are weighted by year to reflect probabilistic recency. According to past studies on technology management, a probability of 0.6 was used to find the EWMA [
85,
86]. EWMA produces the smoothed probabilities of topics over time and
Figure 5 shows the modified probabilities.
Next, we can use this information to define the hot and cold statuses. A topic is classified as hot if its probability per year is above the threshold. The topics were categorized into either a hot or cold status by applying the thresholds to EWMA-weighted probabilities. If the EWMA probability of a topic was beyond the threshold, the topic was labeled as hot. As explained above, this paper chose the threshold that could find the top 10 hot topics per annum. It is empirically discovered that 60 percentiles of topic probabilities per annum could define the top 10 topics from most periods.
As shown in
Figure 4 and
Figure 5, the probability of various topics continues to rise and fall. The hot and cold statuses of a topic can each show two different kinds of recurrent emergences: a topic assuming a hot status from a cold status, one maintaining its hot status, a topic assuming a cold status from a hot status, and one maintaining its cold status. Such a rise and fall could provide the following possible recurrent events: becoming hot, remaining hot over a period, becoming cold, and remaining cold over a period of years. Among those, the following events were focused and analyzed in this paper. Becoming hot represents the event of changing from a cold to a hot status for a particular topic. Becoming cold represents the event of changing from a hot to a cold status for a particular topic. That is because our focused cases can cover other cases. For example, the case of becoming hot has the gap time that is almost overlapped by the case of remaining cold. The case of becoming cold has the gap time that is mostly overlapped by the case of remaining hot. Such an overlap could lead to the duplicated models for different cases. In addition, the case of becoming cold can be interpreted as the case of remaining hot due to such an overlap.
After the recurrent patterns were identified, the time gap of each topic was calculated for these patterns. For example, assume that the probability of a certain topic was above the threshold in 1995, 2001, 2005, 2008, 2009, 2010, and 2011. The abovementioned patterns, such as becoming hot or cold, could be identified and their time gaps were calculated. The patterns for becoming hot ranged from 1983 to 1995 (a time gap of 12 years), 1996 to 2001 (five years), 2002 to 2005 (three years), and 2006 to 2008 (two years). The patterns for becoming cold ranged from 1995 to 1996 (a time gap of one year), 2001 to 2002 (one year), and 2005 to 2006 (one year). The time gap was modeled with both the patent-related and topic-related explanatory variables.
With the above setting, we constructed PWP-GT models of two patterns for each threshold. As a result, two results on becoming hot or cold were obtained. The estimated PWP-GT coefficient can be interpreted as follows: if the coefficient is positive, then the events tend to quickly reoccur. It is necessary to carefully apply such an interpretation in this study. For the case of becoming hot, a shorter time gap is preferred and the estimated coefficients are interpreted as usual. However, for the case of becoming cold, a longer gap time is appropriate and the estimated coefficients need to be interpreted inversely.
Table 5 and
Table 6 show the results of PWP-GT for the two types of patterns, and the p-values of the variable that are less than 5% are considered to be significant.
The PWP-GT model for becoming hot showed
as 0.509.
Table 6 shows that for the results with a threshold of 0.6 for the average age of patents, the following variables were the significant variables associated with the recurrence of becoming hot: the average period of patents, the average number of patent IPCs, the transitivity on the topic network per topic, and the degree centrality on the topic network per topic.
The average age of patents was positively associated with the frequent recurrence of becoming a hot topic. If the average age increases by one unit, the hazard ratio of becoming a hot topic increases by 72.0461 times. If it decreases by one unit, the hazard ratio decreases by 98.61%. On the other hand, the average period of patents was negatively associated with the frequent recurrence of becoming a hot topic. A one-unit increase in the average period is 0.573 times more likely to decrease the hazard ratio of becoming a hot topic. If it decreases by one unit, the hazard ratio increases by 74.52%. In addition, the average IPCs of patents were negatively associated with the frequent recurrence of becoming a hot topic. If the average number of IPCs increase by one unit, the hazard ratio of becoming a hot topic decreases by 89.15%. If it decreases by one unit, the hazard ratio increases 9.2153 times.
With regard to the topic-related variables, the transitivity of the topic network was positively associated with the frequent recurrence of becoming hot. If the transitivity increases by one unit, the hazard ratio of becoming a hot topic increases by 2.711 times. A one-unit decrease in transitivity is also 0.3689 times more likely to decrease the hazard ratio. On the other hand, the degree centrality on the topic network was negatively associated with the frequent recurrence of becoming hot. If that centrality increases by one unit, the hazard ratio of becoming a hot topic decreases by 84.42%. A one-unit decrease in transitivity is also 6.42 times likely to increase the hazard ratio.
The model shows
as 0.413, and
Table 6 provides that the following variables are significant on the recurrence of becoming a cold topic: the average forward citations of patents, the average age of patents, the average period of patents, the trend of topic slope, and the transitivity on the topic network per each topic. The dependent variable of this model is the frequent recurrence of becoming a cold topic, and the coefficient is inversely interpreted.
The average forward citations of patents were negatively associated with the frequent recurrence of becoming a cold topic. If the average forward citation increases by one unit, the hazard ratio of becoming a cold topic decreases by 81.38%. If it decreases by one unit, the hazard ratio increases by 5.3716 times. In terms of maintaining a hot topic, the average forward citation was positively associated. In addition, the average IPCs of patents were negatively associated with the frequent recurrence of becoming cold topic. If the average number of IPCs increases by one unit, the hazard ratio of becoming a hot topic decreases by 88.8%. If it decreases by one unit, the hazard ratio increases by 8.9324 times. This variable could be also inversely interpreted that it was positively associated with the topic’s maintaining a hot status.
On the other hand, the following variables showed a positive association with becoming a cold topic, and those variables could be negatively interpreted in terms of maintaining a hot topic status. First, the average age of patents was positively associated with the frequent recurrence of becoming a cold topic. A one-unit increase in the average period is 16.6845 times more likely to increase the hazard ratio of becoming a cold topic. If it decreases by one unit, the hazard ratio decreases by 94.1%. Next, with regard to the topic-related variables, the transitivity of the topic network was positively associated with the frequent recurrence of becoming cold. If the transitivity increases by one unit, the hazard ratio of becoming a hot topic increases by 1.9745 times. A one-unit decrease in transitivity is also 0.5065 times more likely to decrease the hazard ratio. Finally, the trend of the topic slope was positively associated with the frequent recurrence of becoming hot. If that centrality increases by one unit, the hazard ratio of becoming a hot topic increases 2.1810 times. A one-unit decrease in transitivity is also 0.4585 times likely to decrease the hazard ratio.
Interestingly, the average age of patents, the average IPCs of patents, and the transitivity of the topic network have the same sign of coefficients. These variables were positively associated with becoming a hot topic, and those were negatively associated with maintaining a hot topic. Those topics were considered to lead the change of topic status. On the other hand, the average period of patents and the degree on the topic network only affected becoming a hot topic. Similarly, the average forward citations of patents and the trend of the topic slope only had an influence on becoming a cold topic.