A Knowledge Discovery Method for Landslide Monitoring Based on K-core Decomposition and the Louvain Algorithm

: Landslide monitoring plays an important role in predicting, forecasting and preventing 7 landslides. Quantitative explorations at the subject level and fine-scale knowledge in landslide 8 monitoring research can be used to provide information and references for landslide monitoring status 9 analysis and disaster management. In the context of the large amount of keyword co-occurrence network 10 information, it is difficult to clearly determine and display the domain topic hierarchy and knowledge 11 structure. This paper proposes a landslide monitoring knowledge discovery method that combines the K-12 core decomposition and Louvain algorithms. In this method, author keywords from the literature are used 13 as nodes to construct a weighted co-occurrence network, and a pruning standard value is defined for K. 14 The K-core approach is used to decompose the network into subgraphs. Combined with the unsupervised 15 Louvain algorithm, subgraphs are divided into different topic communities by setting a modularity 16 change threshold, which is used to establish a topic hierarchy and identify fine-scale knowledge related 17 to landslide monitoring. Based on the Web of Science, a comparative experiment involving the above 18 method and a high-frequency keyword subgraph method for landslide monitoring knowledge discovery 19 is performed. In the resulting 5-core network subgraph of landslide monitoring keyword co-occurrence, 20 17 community structures can be identified, and the degree value and density of subcommunities are 21 analysed by taking the community with the largest proportion of nodes as an example. The results show 22 that the retention time of the proposed method is significantly lower than that of the traditional method. 23


86
Landslide monitoring provides strong technical support for understanding landslide evolution 87 processes and is an important approach for disaster prevention and reduction (Whiteley et

106
and malaria research (Fu et al. 2015). The Louvain algorithm is a computationally expensive and time-107 consuming algorithm (Blondel et al. 2008;Orman et al. 2011;Meo et al. 2011) that is suitable for the 108 division of small and medium-sized networks. Rich text semantic relations can produce dense topics for 109 knowledge discovery (Daud et al. 2012). For some networks with small numbers of nodes, the topic 110 hierarchy can be effectively determined with the Louvain algorithm, but for networks with abundant 111 information or unclear expressions, pruning is needed to determine and display the topic hierarchy.

112
Previous studies (Xiao et al. 2016;Kadi et al. 2017;Zhao et al. 2014) generally set thresholds to screen 113 keywords according to the word frequency or edge weights, but these methods did not consider the 114 possible effect of semantic association between two keywords. Seidman (1983) proposed the K-core 115 approach to express the specific hierarchical structure properties and hierarchical characteristics of 116 networks, and this method has been widely applied to hierarchical decomposition networks (Zhang et al.

121
This paper presents a combined quantitative and qualitative method to explore the subject hierarchy 122 and fine-scale knowledge in the research field of landslide monitoring and to analyse the degree, density 123 and community division results for the resulting subnetworks. The remainder of this paper is organized 124 as follows. In the first section, the methods, including the overall research concept, are introduced, and  The technical route of knowledge discovery in the field of landslide monitoring is shown in Fig. 1.

132
The Web of Science preprocesses data through data filtering to reduce invalid data and noise in the 133 original product. According to the word frequency and co-occurrence relationships among the extracted 134 keywords, the co-occurrence matrix is obtained, and a co-occurrence network of weighted keywords 135 related to landslide monitoring is constructed. The pruning index is defined, and a co-occurrence network 136 subgraph is generated based on the structure of the peripheral nodes; the core nodes are retained, and

159
where represents the K value of each shell, is the number of shells, M is the total number of 160 nodes, and i is the shell for each k value. When the value of node k is less than K, some of the nodes can 161 be deleted; otherwise, all nodes should be reserved. As shown in Fig. 2, the network consists of three 162 shells that contain 12 nodes. Eq. 1 shows that some nodes in shell 1 need to be removed. By defining the 163 K-value, the standard of the pruning generation subgraph is defined. In the next section, the process of 164 generating K-core subgraphs for landslide monitoring is introduced.

168
The process of decomposing the keyword co-occurrence network according to the K-value is shown 169 6 in Fig. 3. The K-core subgraph is the union of all shells with k-values greater than or equal to K.

170
According to the K value of each node, the relationship between the node and the co-occurrence matrix 171 of landslide monitoring is assessed, and some nodes can be removed. In this study, we briefly discuss the 172 influence of the proposed method and the high-frequency nodes on the community structure detection 173 algorithm applied to the landslide monitoring co-occurrence network. For networks with the same 174 amount of node information and fewer edge connections than k-subgraphs, the proposed method can 175 significantly reduce the run time while ensuring high quality. Louvain algorithm, the most time-consuming step is to divide a single node into communities (i.e., the 188 first stage). Therefore, the K-core algorithm is needed to prune and retain the main community structure.

189
After pruning, the process of knowledge discovery based on the corresponding landslide monitoring co-190 occurrence network is as follows.

191
The first stage involves calculating the modularity Q according to the input node and edge set. The where , is the sum of the edge weights of nodes in the community, m is the number of edges,

214
where n is the total number of edges in the network, A , represents the weight of an edge between 215 keyword nodes, and k and k denote the total weights of all the edges associated with the two 216 keywords. c is a Boolean function that depends on the keyword nodes in the current community.

217
Generally, the larger the modularity value is, the better the division result. The range of modularity is [-

227
Then, 12193 keywords were obtained by extracting author keywords, which were used to construct a 228 keyword co-occurrence network. As shown in Table 1, since the total number of co-occurrence 229 relationships between 12193 keywords is 148669249, it is difficult to create a huge data set, and many 230 single-frequency keywords are not associated with other keywords in the co-occurrence relationship set.

231
Therefore, this paper selects 2589 keywords with frequencies greater than or equal to 2 to construct a 232 keyword co-occurrence network for analysis, and a total of 19305 co-occurrence semantic relationships 233 are obtained.  Based on the effective literature data set, the co-occurrence frequencies for keywords can be 242 calculated, and the co-occurrence matrix can be created. After K-core analysis, the keyword network was 8 divided into 25 levels, as shown in Fig. 4. The number of nodes connected to each node is called the node 244 degree, and the average value of all node degrees is called the network average degree, which is used to 245 represent the complexity of the network (Freeman 1979). As shown in Fig. 4  289 Table 2 Keywords associated with the landslide monitoring communities (K ≥ 5)

304
The abovementioned community structure detection method is evaluated through the same high-305 frequency keyword subnet as the 5-core node. After Louvain community division, 18 community 306 structures were obtained, with a modularity of 0.3855. Additionally, the community with the largest 307 proportion of nodes was selected as the representative community (Fig. 8)  The results of community detection based on high-frequency keyword pruning and the k-core 324 method were evaluated based on the relative run time and modularity Q value. The relative run time 325 refers to the ratio of the community detection time after pruning to that before pruning. The results shown 326 in Fig. 9 indicate that the overall run time of the K-core pruning method is significantly lower than that 327 of the high-frequency keyword feature selection method; the modularity of the K-core pruning method 328 fluctuates, and that of the K-core pruning method is slightly higher than that of the high-frequency 329 keyword feature selection method. When the core value is 5, the modularity of the K-core pruning 330 network community structure is higher than that of the high-frequency keyword network structure.

342
(1) To explore the topic hierarchy and fine-scale knowledge in the landslide monitoring field, the 343 degree value characteristics, subgraph density and community structure of nodes in the keyword co-

348
(2) K-core decomposition is used to generate subgraphs, and the optimal subset is selected by