Spatial Forecast of Landslides in Three Gorges Based On Spatial Data Mining

The Three Gorges is a region with a very high landslide distribution density and a concentrated population. In Three Gorges there are often landslide disasters, and the potential risk of landslides is tremendous. In this paper, focusing on Three Gorges, which has a complicated landform, spatial forecasting of landslides is studied by establishing 20 forecast factors (spectra, texture, vegetation coverage, water level of reservoir, slope structure, engineering rock group, elevation, slope, aspect, etc). China-Brazil Earth Resources Satellite (Cbers) images were adopted based on C4.5 decision tree to mine spatial forecast landslide criteria in Guojiaba Town (Zhigui County) in Three Gorges and based on this knowledge, perform intelligent spatial landslide forecasts for Guojiaba Town. All landslides lie in the dangerous and unstable regions, so the forecast result is good. The method proposed in the paper is compared with seven other methods: IsoData, K-Means, Mahalanobis Distance, Maximum Likelihood, Minimum Distance, Parallelepiped and Information Content Model. The experimental results show that the method proposed in this paper has a high forecast precision, noticeably higher than that of the other seven methods.


Introduction
The Three Gorges is an area in China in which the distribution density of landslides is very high and the population is very concentrated. In Three Gorges there are frequent landslides and the potential risk of landslide disasters is enormous. After the water has been normally sluiced in the Three Gorges dam, under the effect of factors such as high water levels, water level changes in the reservoir and rainfall, the landslides obviously become more active. It is expected that for a decade or maybe even up to twenty years or more after the Three Gorges Dam has been built, these landslides would be still very active, so if landslides are not studied, spatially forecasted, prevented and remedied, this could have a large and detrimental influence on the normal operation of the dam, navigation buildings and workshops in Three Gorges, which could result in the reservoir in Three Gorges not being able to be normally sluiced and even endanger the properties and lives of the nearby inhabitants.
Extensive attention has been paid internationally to the study of landslide risks. The gestation and occurrence of landslides is greatly influenced by various factors and it is often quite difficult to establish the causality between the occurrence of landslides and those factors. The restriction of the evaluation methods to single factor quantification has a direct influence on the reliability and veracity of the evaluation results, therefore those methods are largely restricted to the area of spatial forecasting of regional landslides. At present the key techniques and methods for effective and scientific spatial forecast of regional landslides are both hot research topics and challenges in the area of landslide forecasting.
In the past 30 years, national and international researchers have conducted many studies on spatial forecasting of landslide disasters [1][2][3][4]. This research has mostly addressed two aspects [1,3]: establishment of semi-quantitative or quantitative sensitivity indexes for landslide disasters by comparing the distribution maps of landslide disasters with the maps of various factors, and secondly calculation and analysis of the sensitivity indexes, and adoption of the high and the low sensitivity indexes to represent the dangerous and stable regions, respectively, of landslide disasters. The other is to theoretically analyze the relationships between landslided and various influencing factors, adoption of methods of scoring or appraising factors to assign weights to the various factors, and then calculate the weight coefficients to establish the risk levels for landslides. At present, the major forecasting methods include experimental models (expert scoring), symbolic statistic models (regress analysis, judgement analysis, clustering analysis), information models, fuzzy judegement models (fuzzy synthetical judgement, fuzzy reliability analysis), gray models, pattern recognition models (expert system, neural network), nonlinear models (Fractal theory), and so on, of which the Information Content Model is the more mature and the one that is applied more extensively. Along with high-speed development of computer technique and geoinformatics, the combination of GIS (Geographical Information Science) and quantitative spatial forecast methods for landslides has become a new area in geological disaster research.
Yin et al. [5,6] and Yan et al. [7] have carried out deep and systematic research on spatial forecasting and the stability partitioning of landslides and slopes and they have proposed models such as the Information Analysis Model [5], the Multi-factor Regress Model [6], the Clustering Analysis Model [7] and the Judgement Analysis Model [7]. However the precision of these models is not high (about 75-80%). Xie et al. [8] adopted the Information Content Model to make spatial forecasts of landslides in Pangan County, Zhejiang Province, in August, 2004 when Typhoon Yunna came to Zhejiang Province, and the forecast precision reached 75.8%. Zhu et al. [9] adopted Information Content Model to segment landslide risk districts in the upriver regions of the Yangtse, Chongqing Province, Sichuang Province, Guizhou Province. Based on landslide hazard analysis they utilized the MapGIS software platform to perform a fragility analysis of the regional economy and carry out landslide risk assessments. Shan et al. [10] adopted Artificial Neural Network and GIS, combined nonlinear theory and spatial multi-analysis and established a nonlinear forecast model of landslides based on environmental factors in the middle region of Da Yu Island in Hongkong. Recently international scholars began to research the relationship between single environmental factors and landslides in a region. Lumb [11] analyzed the pertinence of different types of deposits, underlying bed rock and landslides. The recurrence of slope failures in residual soils of Hong Kong is analysed for the period 1950 to 1973, and various factors contributing to the instability are described. It is postulated that the prime cause of the failures is direct infiltration of rain water into the surface zones of the slopes, producing a loss of effective cohesion following the saturation of the soil. Prevention of slips implies protection against excessive infiltration. Ruxton [12] analyzed the relationship between mantle rock and landslides and discovered efflorescence had an important influence on landslides. Fourie [13] studied the relationship between rainfall pervasion and shallow landslips. He found slope failures usually occurred in regions of the world where steep slopes consisting of residual soils were subjected to periods of prolonged and heavy rainfall. A mechanism for the failure of these slopes was postulated whereby in situ soil suctions were decreased by the ingress of a wetting front until a critical depth was reached where the shear strength of the soil was no longer sufficient to ensure stability. He proposed a technique for predicting whether a particular rainfall event (defined in terms of intensity, duration and return period) would cause ingress of a wetting front to this critical depth, with particular reference to the failure of a number of road embankments in the Northern Province of South Africa. Collison [14] studied the relationship between vegetation and slope stability in torrid zones and discovered that along with growth of vegetation roots, water permeability would increase and soil strength would decrease. Evans [15] studied the relationship between elevation and rainfall. He found the underlying geology and the angle of slope were the most important parameters for determining natural terrain landslide susceptibility on a regional scale. Geological strata which appeared to be particularly susceptible included rhyolitic and dacitic lavas, jointed tuffs, layered sequences of volcaniclastic rocks and lavas, and layered sedimentary sequences. The most susceptible slopes were generally those with angles of approximately 35 o to 40 o . The shape and aspect of a particular slope may also be useful in assessing susceptibility. Carrara [16] utilized 1,500 landslides in the GIS database and studied the relationship between shallow landslides and landforms, and the results showed that there existed a good statistical correlation between abrupt landforms and landslides. Brabb [17] took slope as the weight, calculated the percentages of 2,000 landslides and 12 factors and concluded geology, soil and slope are the major factors which have an influence on landslide stability.
Traditional research on landslides lacks information extraction and mining of complicated landslide disaster systems and appropriate consideration of the indeterminacy and nonlinear character of landslide systems, and forecasting models based on a single factor cannot produce exact forecasts and estimations of landslide disasters. Furthermore the data come from a variety of sources, and as the means of data accumulation are improved, the contents of the forecast database for landslide disasters have become enormous and more complicated, so the current trend is to determine how to adequately utilize the large amount of available data to realize the spatial forecasting of landslides and increase the precision and effectiveness of forecasts [18][19][20][21][22]. It should be possible to directly drive the development of the theory and technique of landslide risk assessment by utilizing artificial intelligence techniques based on spatial data mining and knowledge discovery. However, at present scholars and researchers primarily use the traditional methods to carry out spatial forecasts of landslides, which need manual intervention, so they are characterized by poor intelligence, low precision and effectiveness. Zhao et al. [23] adopted a decision tree to estimate landslide risk, but he only chose 4 factors of lithology, elevation, slope and spectra, and in his forecast result, many landslides fell in the low risk region and the precision isn't high. Ma et al. [24] adopted a support vector machine to forecast and assess landslide disasters, but his forecast precision is merely 75.45%, and in his results, there were still some landslides lying in stable regions.
In the paper, and considering the Three Gorges, with a complicated landform and frequent landslide disasters, spatial forecast of landslides is studied by establishing 20 forecast factors of spectra, texture, vegetation coverage, water level of reservoir, slope structure, engineering rock group, elevation, slope, aspect, etc. China-Brazil Earth Resources Satellite (Cbers) images, geological maps and terrain maps are adopted based on a C4.5 decision tree to mine spatial forecast criteria for landslides in Guojiaba Town, Zhigui County, in the Three Gorges region and, based on this knowledge perform intelligent spatial forecast of landslides in Guojiaba Town. All landslides lie in the dangerous and unstable regions, so the forecast result is good and the precision is high.

Choice of Forecast Factors
The nature and severity of landslide disasters are related to many factors such as geological structure, stratum, lithology, terrain and physiognomy, vegetation coverage, rainfall, human and engineering activities. In this paper, the characteristics and factors influencing landslides in the Three Gorges region are analyzed; 20 forecast factors are established referring to the aspects of spectra, texture, vegetation coverage, water level of reservoir, slope structure, engineering rock group, elevation, slope and aspect.

Spectral Factors
Remote sensing images can factually reflect ground circumstances and information, and it has become a new research trend to adopt remote sensing images to perform quantitative detection and forecast of landslide disasters [25][26][27][28]. In this paper Cbers images of 19.5 metres resolution produced in Three Gorges in 2006 were adopted to study the spatial forecasting of landslides. Division between two spectra can effectively restrain some interferential information and give prominence to the key information. To effectively restrain vegetation information and give prominence to lithology information, the division between the two spectra of Cbers3 and Cbers2 is made. The forecast spectral factors are chosen as Cbers1, Cbers2, Cbers3, Cbers4 and Cbers3/Cbers2.

Textural Factors
The form of landslides is connected the strata and lithology. The texture is a special feature of some strata and lithology and can be directly used to analyze and recognize landslides [29][30][31][32]. The four Cbers spectral images are shown in Figures 1-4, respectively. In Cbers3 the textural information is the most clear and abundant, so the textural information in Cbers3 image is chosen to recognize and forecast landslides.

Homogeneity
in which p(i, j) is the pixel value in the position (i, j) in GLCM.

Vegetation Coverage Factors
In Three Gorges there is flourishing vegetation and on different strata and rocks, the types and degrees of vegetation growth are also different. Landslides often happen in strata with low vegetation coverage. For the regions of high vegetation coverage, the vegetation index is an effective forecast factor. Vegetation index is a quantitative value extracted from a multi-spectral remote sensing image and reflects the vegetation condition on the surface of the earth. Normalized Difference Vegetation Index (NDVI) possesses a wide detection range of vegetation coverage and good adaptability of time phase and space, so NDVI is chosen as the vegetation coverage forecast factor.

Geological, Physiognomy and Environmental Factors
According to 1:0.05 million geological map and 1:0.01 million terrain map, six geological structure, physiognomy and environmental forecast factors are established, such as water level of the reservoir, slope structure, engineering rock group, elevation, slope and aspect. The slope structure and engineering rock group factors are provided by the geological map and the water level of the reservoir factor is provided by the terrain map. The contour line data are protracted by an aerial survey performed in Nov. 2007 and provided by the Three Gorges Headquarters. The factors of elevation, slope and aspect are produced from contour data. According to the influence of water fluctuation, the water level of the reservoir is classified into four classifications: poorly influenced region, secondly influenced region, strongly influenced region and fluctuating region. Slope structure is classified into five classifications: slopes with converse direction, slopes with the same direction, slopes with horizontal direction, converse slope and direct slope. Engineering rock group is classified into four classes: soft rock, hard rock, alternate soft and hard stratum, loose deposits.

Decision Tree C4.5 Algorithm
C4.5 algorithm [20] based on ID3 algorithm adds the function of translating a decision tree into equivalent production rules and solving the continue value study problem. C4.5 adopts an information entropy method and chooses the attribute of maximum information gain rate and the corresponding segment threshold as the best test attribute and segment threshold.

Calculating information entropy in classification
Suppose S be the number of samples in training sets and there is m classifications of samples C i (i=1, 2, …, m). S i is the number of samples in Classification C i. The computational formula is as follows: is the probability of arbitrary sample belonging to C i . 3. Calculating information gain and information gain rate of attribute The information gain function of Attribute X is:

Calculating information entropy of each attribute
The information gain function tends to produce a big value for the test, which probably produces multi-branches. However a test of producing multi-branches doesn't mean it can obtain a better forecast result for those unknown objects. The information gain rate function can make up the lack of information gain. Information gain rate is an improvement of information gain which can eliminate the influence of the attribute of producing multi-branches. The information gain function considers not only the number of nodes but also the size of each node (number of samples included) for each segment. What it considers is not the amount of information included in classifications but each segment. The information gain rate of Attribute X is: in which v is the number of branches of the node and S i is the number of records for the ith branch.
4. Producing decision tree It in turn calculates the information gain Gain(X) and information gain rate A(X) of each attribute and chooses the as the test attribute the one which possesses the biggest information gain rate and the information gain value that is not lower than the average of the information gains of all the attributes. It takes the test attribute as a node and each distribution of the attribute as a branch to segment the samples. If all the samples of a node belong to the same class, the node is a leaf which is marked by its classification. It would form the initial decision tree by recursion when all the samples of each subset obtain the same value on the main attribute or there is no attribute for being utilized.

Pruning of Decision Tree
In order to remove singular branches introduced by noise data and isolated points, a pruning of the initial decision tree by a pruning algorithm should be done. Firstly, for each non-leaf node it calculates Expect Error Probability while the sub-tree of the node is clipped. Secondly it utilized the error rate of each branch combined with the weight of each branch to calculate the Expect Error Rate while not clipping the branch. If Expect Error Rate while clipping the branch obtains higher value than the one while not clipping the branch, the sub-tree is retained. Otherwise the sub-tree is clipped. Finally it would obtain the decision tree of the smallest Expect Error Rate.

Experiments of Criterion Mining and Spatial Intelligent Forecast of Landslides in Three Gorges
In the paper Guojiaba Town Zhigui County in Three Gorges was chosen as a research area in which landslides happen frequently. Guojiaba Town is composed of the strata of Upper Shaximiao Group J 2 s, Lower Shaximiao Group J 2 xs, Niejiashan Group J 1-2 n, Xiangxi Group J 1 x, Shazhenxi Group T 3 -J 1 s, Ba First Section Badong Group T 2 b 1 and Jia Third Section Jialingjiang Group T 1 j 3 . In Guojiaba Town the geological structure is very complicated and there are more than 30 landslides distributed there, for example, Shizibao Landslide, Jinchaiwan Landslide, Zhangjiawan Landslide, North Longwangmiao Landslide, Longtanwan Landslide, Dengjiapo Landslide and South Cement Factory Landslide. A 1:0.05 million geology map of Guojiaba Town is shown in Figure 5. In the paper Cbers imagea with 19.5 resolution produced in 2006 were adopted. They are shown in Figures 6-7. The piling graph of the remote sensing image with the disaster distribution graph is shown in Figure 8. The primarily adopted forecast factor images are shown in Figures 9-17.            In the paper 2,696 sample points are chosen to produce the decision tree of 167 leaf nodes, and the study precision is 99.5%. The decision tree after pruning includes 136 leaf nodes and the study precision is 99.1%, which is shown in Table 1. The mined criteria of spatial forecast of landslides in Guojiaba Town are shown in Table 2. The rules possess high confidence values and the precision of rule extraction is 99.3%. The connotations of the values of some forecast indexes are shown in Table  3. Spatial forecast of landslides in Guojiaba Town Zhigui County in Three Gorges was done based on knowledge driving, and the forecast result is compared with the ones obtained by seven methods: Information Content Model, IsoData Method, K-Means Method, Mahalanobis Distance, Maximum likelihood, Minimum Distance and Parallelepiped. The forecast precision of decision tree method is 99.15%, obviously superior to the other seven ones. The forecast results and precisions of various methods are shown respectively in Figures 18-25 and Table 4.         The experimental results have shown that IsoData Method and K-Means Method define all the pixels as a dangerous region, and Parallelepiped Method divides most of pixels into dangerous region and unstable region, which is obviously inappropriate and gives low forecast precision. The Minimum Distance Method is not good at dividing dangerous region, unstable region and stable region and the forecast precision is also low. The Information Content Model erronueously assigns the basically stable area to the north of Guojiaba Town as an unstable region. The Maximum Likelihood Method and Mahalanobis Distance Method can distinguish four kinds of regions as dangerous, unstable, basically stable and stable ones, but they still make mistakes in some areas and their precisions are average. The Decision Tree Method does well in the spatial forecast of landslides and obtains a very high forecast precision.
The forecast result of Decision Tree Method is added to the disaster distribution map, which is shown in Figure 26. The compiled result shows that all the landslides in Guojiaba Town lie in dangerous and unstable regions and confirm that the Decision Tree Method provides a good forecast result and a very high forecast precision which is obviously superior to the other seven methods.

Conclusions
In Three Gorges there are extensively distributed active landslides and the population is very concentrated, so the potential landslide risk is tremendous. To guarantee the normal running of the dam, buildings for navigation and workshops and to safeguard the properties and lives of the inhabitants in Three Gorges, the landslides should be exactly predicted and forecasted, so it is very important and significant to make accurate spatial forecast of landslides in the Three Gorges region.
The formaton of landslides is related to many factors of geological structure, strata, lithology, terrain and physiognomy, vegetation coverage, rainfall, human and engineering activities. By studying the mechanisms of landslides in Three Gorges, 20 forecast factors are established covering the various aspects of spectra, texture, vegetation coverage, water level of reservoir, slope structure, engineering rock group, elevation, slope and aspect. Those factors are closely related to the mechanism and formation of landslides in Three Gorges.
China-Brazil Earth Resources Satellite (Cbers) image with the resolution of 19.5 meters can factually reflect ground circumstances and does well in monitoring and forecasting regional landslides. Cbers images produced in 2006 were adopted to spatially forecast landslides in Three Gorges.
The Decision Tree Method is the most attractive data mining method and does well in pattern recognition and classification. The C4.5 algorithm adopts an information entropy method and chooses the attribute of maximum information gain rate and the corresponding segment threshold as the best test attribute and segment threshold. In this paper Guojiaba Town (Zhigui County) in the Three Gorges region, in which there are more than 30 landslides, was chosen as the study area. The C4.5 algorithm was adopted to produce the decision tree after pruning with the study precision 99.1% and mine the spatial forecast criterions of landslides in Guojiaba Town. The mined criteria possess high confidence values and the precision of criteria extraction is 99.3%.
Based on the mined criteria by knowledge driving intelligent spatial forecast of landslides in Guojiaba Town is realized, with the high forecast precision (99.15%) and Kappa Coefficient (0.9876). All landslides lie in the dangerous and unstable regions, so the forecast result is good. Another seven methods were also adopted to make spatial forecasts of landslides in Guojiaba Town, which are IsoData, K-Means, Mahalanobis Distance, Maximum likelihood, Minimum Distance, Parallelepiped and Information Content Model. The forecast precisions of the seven methods are 15.99%, 15.99%, 73.44%, 80.97%, 28.21% and 46.59%, respectively, with Kappa Coefficients of 0, 0, 0.6311, 0.7322, 0.0906 and 0.2126. By the seven methods some landslides are even classified as stable regions, or the stable and basically stable regions are mistakenly recognized as dangerous or unstable ones. The method proposed in the paper can realize accurate spatial forecast of landslides and is obviously superior to the other seven ones tested. Furthermore in the paper the mined forecast criteria possess the virtues of quantification, so they can provide intelligent spatial forecast and interpretation of landslides in the important Three Gorges region.