Method for Mid-Long-Term Prediction of Landslides Movements Based on Optimized Apriori Algorithm

: In the study of the mid-long-term early warning of landslide, the computational e ﬃ ciency of the prediction model is critical to the timeliness of landslide prevention and control. Accordingly, enhancing the computational e ﬃ ciency of the prediction model is of practical implication to the mid-long-term prevention and control of landslides. When the Apriori algorithm is adopted to analyze landslide data based on the MapReduce framework, numerous frequent item-sets will be generated, adversely a ﬀ ecting the computational e ﬃ ciency. To enhance the computational e ﬃ ciency of the prediction model, the IAprioriMR algorithm is proposed in this paper to enhance the e ﬃ ciency of the Apriori algorithm based on the MapReduce framework by simplifying operations of the frequent item-sets. The computational e ﬃ ciencies of the IAprioriMR algorithm and the original AprioriMR algorithm were compared and analyzed in the case of di ﬀ erent data quantities and nodes, and then the e ﬃ ciency of IAprioriMR algorithm was veriﬁed to be enhanced to some extent in processing large-scale data. To verify the feasibility of the proposed algorithm, the algorithm was employed in the mid-long-term early warning study of landslides in the Three Parallel Rivers. Under the same conditions, IAprioriMR algorithm of the same rule exhibited higher conﬁdence than FP-Growth algorithm, which implied that IAprioriMR can achieve more accurate landslide prediction. This method is capable of technically supporting the prevention and control of landslides.


Introduction
Geological disasters pose a serious threat to the safety of human life and private property, and landslides are one of the commonest geological disasters. The occurrence of landslides not only seriously threatens the safety of human life and private property, but also severely damages the environment and ecology [1]. One of the feasible ways to comprehensively control landslide hazards is to predict landslide hazards [2].
In accordance with system theory and nonlinear science, many novel techniques have been employed to study the stability of landslides, a range of comprehensive prediction models have been summarized, and landslide prediction criteria have been proposed, including decision tree [3,4], generalized FLaIR model (GFM) [5], artificial neural network [6], object-oriented methods [7] and support vector machine (SVM) [8]. The landslides of different causes were systematically summarized, and these landslides of different types were comprehensively analyzed to build prediction models for different landslides [9]. However, these prediction models are not applicable to the mid-long-term application of landslide hazards [10]. Furthermore, landslide occurrence is difficult to predict accurately, and the mid-long-term early warning of landslides is hard to achieve [11].
Recently, the big data technology has been increasingly employed for landslide disaster early warning [12,13]. Through the analysis of historical landslide data, the use of big data technology becomes a hotspot for prompt and feasible landslide mid-long-term early warning. The distributed landslide prediction model does not enhance the efficiency of the algorithm. In the study on mid-long-term landslide warning, the operational efficiency of the prediction model is vital to the timeliness of landslide prevention and control [14].
Association rule mining can be used to determine the levels of association among various causes [15]. Association rule mining has been successfully applied to uncover such cause-and-effect relationships in a variety of fields, including causes of occupational accidents [16] and prediction landslide hazards [17]. When the Apriori algorithm runs based on the MapReduce framework, the Apriori algorithm will generate numerous frequent item-sets, significantly affecting the efficiency [18]. To solve this problem, the Apriori algorithm runs based on the MapReduce framework [13,18], whereas it does not optimize the algorithm for landslide data correlation analysis based on the MapReduce framework. In the processing of large-scale landslide data, the algorithm has low efficiency since it should generate many frequent items, which adversely affects the timely prevention and control of landslides [18,19].
To enhance the computational efficiency of the prediction model, the Apriori algorithm was optimized based on the MapReduce framework, and the IAprioriMR algorithm is proposed in this paper. The Three Parallel Rivers area with frequent geological disasters was taken as the study area. The relevant data of landslides in the study area were collected. The feasibility of the algorithm was verified using the IAprioriMR for the mid-long-term early warning of landslide.

IAprioriMR
Apriori is one of the commonest association rule algorithms, and several methods have been derived using the Apriori algorithm [20]. The Apriori algorithm refers to a loop method that uses hierarchical sequential search, also termed as an iterative method of layer-by-layer search to generate frequent item-sets. However, a single host cannot handle a large number of operations. Processing considerable data is now becoming a new trend. Recently, many algorithms have been proposed, which apply the Apriori in the MapReduce computing framework [21]. This pattern is based on MapReduce, which is hereinafter referred to as AprioriMR. AprioriMR produces all candidate items one time in the Map task, and the form of each candidate itemset is a product with <k, v> form. Subsequently, according to the differences of Key, the candidate items are classified and counted by Reduce with the minimum support to screen the high frequency candidate item-sets [22].
The optimized conventional Apriori algorithm can fit the MapReduce framework for distributed parallel computing. However, because it produces all candidate item-sets one time in the Map task, there will be a large memory footprint, significantly decreasing the operating efficiency [23]. Thus, the counting in the Map stage will be time-consuming, and valuable association rules are difficult to find. The Apriori algorithm imposes a heavy burden to MapReduce [24].
In general, the association rules are expressed as X=>Y. The definition is as follows: in the presence of X, Y, the large-scale item-sets are defined. I = {i 1 , i 2 , . . . , i n } is a set of items, and a pattern P is defined as {P = {i j , . . . , i k } ⊆ I, j ≥ 1, k ≤ n}. Given a pattern P, its length or size is expressed as |P|, i.e., the number of the singletons it covers. Thus, for P = {i 1 , i 2 , . . . , i k } ⊆ I, its size is defined as |P| = k. Besides, given a set of all transactions T = {t 1 , t 2 , . . . , t m } in a dataset, the support of a pattern P is defined as the number of transactions that P satisfies, i.e., support(P) = |{∀t l ∈ T:P ⊆ t l }|. A pattern P is considered frequent if support(P) ≥ threshold [22], in this paper, the threshold of minimum support is 1%. Thus, for each pair (P l , {supp(P) 1 , supp(P) 2 , . . . , supp(P) m }), the result is a {P, supp(P)}pair, so supp(P) = m l=1 supp (p) l . This method, when dealing with big data, due to the large scale of the transaction, will yield a huge candidate set.
This study aimed to enhance the computational efficiency of a novel algorithm (hereinafter referred to as IAprioriMR). It does not yield the whole set of candidate item-sets C for each transaction t l ∈ T, whereas the subset c ⊆ C consists of patterns of size |P| = s. Hence, a set of iterations are required, one per different item-set-size (see Algorithm 1). Subsequently, the dataset is split into different chunks of data, one per mapper, and each mapper is responsible for analyzing each single transaction t l ∈ T to generate P, support(P) l pairs. Lastly, this algorithm also covers multiple reducers to scale down the computational cost. The major difference between AprioriMR and IAprioriMR lies in the mapper, since IAprioriMR obtains any pattern P of size |P| = s for each sub-database (see Figure 1). Thus, the number of P, support(P) l pairs generated by each mapper is lower than that by the AprioriMR algorithm [22].

Algorithm Performance Analysis
The frequent pattern (FP)-growth algorithm, designed to solve the problem of high number of transactions and comparisons, is well-known in the association rule. FP-growth stores the frequent item-sets into a tree structure, requiring data to be scanned just once. Nevertheless, it still faces considerable candidate item-sets since either the larger number of I/O or the high memory requires to store all sets. Based on the FP-Tree structure, many authors have extended the FP-Growth method [20].
Three algorithms were employed for test in the same hardware environment (DELL R730 dual 8 cores), and the Webdoc dataset was adopted as the experimental data, Webdoc dataset is considered by the pattern mining community as the biggest one. Besides, AprioriMR and IAprioriMR operating environments are three virtual nodes on the physical machine. In different instances, following the same association rules, the performances of three algorithms differ significantly (see Figure 2). In the same instances, the number of nodes in MapReduce significantly affects the performance of the algorithm (see Figure 3).

Algorithm Performance Analysis
The frequent pattern (FP)-growth algorithm, designed to solve the problem of high number of transactions and comparisons, is well-known in the association rule. FP-growth stores the frequent item-sets into a tree structure, requiring data to be scanned just once. Nevertheless, it still faces considerable candidate item-sets since either the larger number of I/O or the high memory requires to store all sets. Based on the FP-Tree structure, many authors have extended the FP-Growth method [20].
Three algorithms were employed for test in the same hardware environment (DELL R730 dual 8 cores), and the Webdoc dataset was adopted as the experimental data, Webdoc dataset is considered by the pattern mining community as the biggest one. Besides, AprioriMR and IAprioriMR operating environments are three virtual nodes on the physical machine. In different instances, following the same association rules, the performances of three algorithms differ significantly (see Figure 2). In the same instances, the number of nodes in MapReduce significantly affects the performance of the algorithm (see Figure 3).    According to the results of comparison experiments, in the instance greater than 32,000, the modified IAprioriMR algorithm mode will be significantly optimized as compared with the conventional parallel AprioriMR and FP-growth. When the amount of data is above 1024 thousand, the efficiency of the IAprioriMR algorithm increases by more than 50% compared with that of the AprioriMR algorithm. With the rise in the number of instances, the computational efficiency of the IAprioriMR algorithm increases more significantly. In the meantime, with the increases in the number of nodes, IAprioriMR significantly shortens the processing time of MR and enhances the efficiency in processing large-scale data.

Study Area
The Three Parallel Rivers cover the Nujiang River, the Lantsang River, the Jinsha River, and the mountains in the basin at 26 • 03 ~29 • 16 N and 98 • 7 ~101 • 19 E (see Figure 4). The north and northwest have temperate and cold temperate monsoon climate, and the south has a plateau monsoon climate. From the Meili Snow Mountain at 6740 m above sea level to the Nu River at an altitude of nearly 700 m (see Figure 5), the terrain height varies significantly, and the geological structure is complex. Thus, this area is a geological disaster-prone area [25].  The occurrence of landslides has numerous factors. The landslide monitoring data generally refers to the quantitative value of the induced factors (e.g., rainfall and water level) affecting the slope variation. During the landslide formation and occurrence, there are also many uncertain factors (e.g., human engineering and sudden earthquakes) [26], as well as numerous landslide monitoring data, including graphical data (GIS data), image data (remote sensing data), and relevant geophysical data. The spatial data (e.g., spatial location and spatial relationship) in landslide monitoring data takes up a large proportion, so landslide monitoring data have close correlation.
The landslide monitoring data exhibit the following characteristics: the attributes of the landslide monitoring data exhibit strong and continuous spatial correlation, and the use of the regression algorithm and the fitting algorithm helps build the pattern for analysis; the effect of rainfall on the displacement of landslide is certainly accumulated and delayed, and the occurrence of landslide may be shortly affected by heavy rainfall or attributed to the rainfall accumulated over time.

Data
The landslide monitoring data from 2000 to 2013 in the Three Parallel Rivers were taken as the experimental data. The total data is 19.6 G, covering a total of 175,000 instances. There are three landslide displacement monitoring stations (the ShangQiaotou, XiangDuo and JuDian landslide monitoring stations), representing the upstream, midstream and downstream areas of the Three Parallel Rivers, respectively.
The Three Parallel Rivers have high rainfall and a clear climate. The rainfall peak in the Three Parallel Rivers area appeared in 2004, and the rainfall has gradually decreased in recent years, as suggested from the rainfall data from 2000 to 2013 (see Figure 6). The groundwater levels in the upstream, midstream and downstream areas of the Three Parallel Rivers area were recorded by ShangQiaotou, XiangDuo and JuDian landslide displacement monitoring stations, respectively. The monitoring station measures the data once a month. Figure 7 shows the annual average rate of change of groundwater level in the area of the Three Parallel Rivers.  Because of the special geographical location, geological disasters frequently occur in the Three Parallel Rivers area. Table 1 lists the types of landslide for each landslide hazard, the nature of the landslide, and the maximum rainfall, etc.

Experiment and Analysis
Studies reported that the occurrence of landslide hazards has been associated with the geological environment (e.g., rainfall, geological structure and groundwater distribution) [27]. In the experiment, setting the groundwater level of the monitoring point, the rainfall, the water level of the parallel river in the Three Parallel Rivers, and the cumulative displacement of the landslide monitoring point are the antecedents of the landslide occurrence, and the landslide occurs, which is detectable. The landslide monitoring data from 2000 to 2011 in the Three Parallel Rivers area were taken as the training pattern, and the landslide monitoring data from 2012 to 2013 as the test data [28].
Given the nature of the Apriori algorithm, to analyze the association rules of landslide monitoring data, the association rules should be first formulated according to the algorithm logic. In this paper, any elements in the former (X) were randomly combined to formulate a series of association rules (see Table 2). The minimum support was set to 0.01 [29], and the confidence degree of each of the mentioned association rules should be calculated by the algorithm. Table 2. Description of some association rules (minsupp = 1%). On the whole, the association rule exhibiting a confidence above 0.7 is considered a strong association rule [17,29,30]. Association rules with low confidence retain rules with a confidence level above 0.7 are deleted, and a series of useful information is calculated, laying a basis for landslide warning.

Rule Information
The value of confidence in association rules is critical to model prediction. The higher the confidence, the more accurate the landslide prediction will be [29]. Thus, under the same experimental conditions, the confidence of IAprioriMR algorithm and FP-Growth algorithm were compared, and the IAprioriMR algorithm is was analyzed in mid-long-term early warning of landslides. According to the experimental results, under the same conditions, the IAprioriMR algorithm (see Table 3) of the same rule exhibits a higher confidence than FP-Growth algorithm (see Table 4), so IAprioriMR can achieve more accurate landslide prediction. To verify the accuracy of the experimental results, test data were taken for analysis, and 21 landslide events were recorded. By calculating the rule with a confidence above 0.7, the pattern successfully determines 16 landslide accidents based on the LS3-1 LS2-1 (see Table 5). Table 5. Accuracy evaluation of landslide comprehensive prediction. The IAprioriMR algorithm successfully determines 16 landslide accidents between 2012 and 2013. Thus it is implied that the IAprioriMR algorithm proposed in this paper is feasible in landslide prediction research, and it can technically support the prevention of landslide disasters.

Conclusions
During the mid-long-term early warning of landslides, the landslide monitoring data continuously increases over time. The efficiency of the prediction model significantly affects disaster prevention and control. To solve the Apriori algorithm, numerous frequent items are generated when the algorithm runs in the MapReduce framework. The set problem, by simplifying frequent item operations, proposes the IAprioriMR algorithm. Performance test experiments reveal that the efficiency of the IApriorIMR algorithm is enhanced with the rise in data volume in terms of computational efficiency. Under the same conditions, the IAprioriMR algorithm exhibits a higher confidence than FP-Growth algorithm, capable of achieving more accurate landslide disaster prediction. Thus, it is implied that IAprioriMR algorithm has promising applications and a bright prospect of further development.
In the future work, we will consider comparing the performances of the Apriori algorithm with classical intensity-duration (ID) schemes in a case study, in order to check the significant (or insignificant) improvements for landslide prediction in early warning systems.