Data Mining and Statistical Approaches in Debris-Flow Susceptibility Modelling Using Airborne LiDAR Data

Cameron Highland is a popular tourist hub in the mountainous area of Peninsular Malaysia. Most communities in this area suffer frequent incidence of debris flow, especially during monsoon seasons. Despite the loss of lives and properties recorded annually from debris flow, most studies in the region concentrate on landslides and flood susceptibilities. In this study, debris-flow susceptibility prediction was carried out using two data mining techniques; Multivariate Adaptive Regression Splines (MARS) and Support Vector Regression (SVR) models. The existing inventory of debris-flow events (640 points) were selected for training 70% (448) and validation 30% (192). Twelve conditioning factors namely; elevation, plan-curvature, slope angle, total curvature, slope aspect, Stream Transport Index (STI), profile curvature, roughness index, Stream Catchment Area (SCA), Stream Power Index (SPI), Topographic Wetness Index (TWI) and Topographic Position Index (TPI) were selected from Light Detection and Ranging (LiDAR)-derived Digital Elevation Model (DEM) data. Multi-collinearity was checked using Information Factor, Cramer’s V, and Gini Index to identify the relative importance of conditioning factors. The susceptibility models were produced and categorized into five classes; not-susceptible, low, moderate, high and very-high classes. Models performances were evaluated using success and prediction rates where the area under the curve (AUC) showed a higher performance of MARS (93% and 83%) over SVR (76% and 72%). The result of this study will be important in contingency hazards and risks management plans to reduce the loss of lives and properties in the area.


Introduction
Rapid population growth and its increasing concentration in urban areas have worsened the severity and impact of gravity-induced disasters across the globe, particularly in the mountainous environment. Debris flow is a natural hazard that occurs in a valley or on a mountain slope destroying

Research Background
Recently, laser scanning technology, often known as Light Detection And Ranging (LiDAR) systems, has become widely accepted as a breakthrough in remote sensing surveying methods since 1999 [62]. This system is used to identify and detect attributes of terrain surface. The architectural and structural development of LiDAR system operations for prosperous data acquisition have been reported in the literature [43,45,[62][63][64][65][66][67]. Based on the platforms, three types of laser scanning technologies have Sensors 2019, 19, 3451 4 of 32 been reported, namely; Airborne Laser Scanning (ALS), Terrestrial Laser Scanning (TLS) and Mobile Laser Scanning (MLS). Their components and mode of operation in data capturing differs; therefore, in this research we considered ALS system because it allows wider area coverage than TLS and MLS. All LiDAR data yields favorable fine-resolution geomorphometric and topographic data in 3D that can hold significantly to an accurate surface representation with a very-high spatial-resolution over a wide area [68]. The LiDAR data reveals micro-morphological features, which are significant for initiating debris flow even under dense vegetation canopies and demonstrates capability in observing earth movements. The proficiency of the LiDAR system has made it valuable in detecting and mapping some catastrophes related to earth movements. The tropical climate of Malaysia makes debris flow analysis a difficult task. Tree canopies, dense forests and rugged terrains constitute a barrier to derive complex morphological and topographic features from microwave remote sensing data [40,44,46,[69][70][71]. Unlike Synthetic Aperture Radar (SAR) data derived elevation model, LiDAR data is not affected by topographic shadowing because of the angle of inclination (<20 • off-nadir) of LiDAR sensors [11,44,45,48,64,72,73]. In a recent study, a LiDAR-derived Digital Elevation Model (DEM) was considered the most favorab le terrain data used to derive topographic attributes integrated with appropriate modelling approach to detect and predict environmental disaster [67,74]. LiDAR-derived DEM provides fine spatial-resolution topographic data, which accurately depicts the terrain configuration [70,75,76].
In some papers, different types of models have been compared in various types of geoscience application. For example SVM and random space in [78,99,103], ANN-fuzzy logic approach [48,61,81,98,104,105]; Decision tree and Neural Networks [77,92,98,105,106] have been applied elsewhere. Reports have shown that some of these approaches out-performed the conventional techniques [49,107,108]. Although every modelling approach has its related merits and demerits which underpin their guiding assumptions, individual studies using MLA for debris flow susceptibility modelling are limited, especially in the case of the newest algorithms in the field of geo-hazard such MARSpline. In the instance of SVMs, parameterization needs or operability values have not been thoroughly investigated; although the few identified approaches are usually adopted for hazard mapping especially landslide susceptibility. Conversely, to our best knowledge, no existing research have compared the two approaches using LiDAR-derived DEM to assess and predict area susceptible to debris flow.
However, to the best of our knowledge in the literature search, very few studies have employed machine learning approach in debris flow susceptibility mapping; Some good examples are the combination of weight and extension method [109] and Analytical Hierarchy Process (AHP) [109][110][111][112]. Some of the studies on debris flow susceptibility in Malaysia [113] have adopted Horton open-source Flow-R software for the identification of debris-flow source area. However, the Horton (Flo-R) method lacks flexibility in the selection of the conditioning factors used and was developed for a regional evaluation purpose [33,114,115].
MARSpline technique is a non-parametric MLA and non-linear estimation procedure. This approach adopts discretization regression gradients at diverse interludes data frame for model assessments [116]. The MARS is best at identifying optimal variable connections and transformations; Sensors 2019, 19, 3451 5 of 32 expose intricate hidden data models successfully, often on a high scale. Also, the model has revealed significant data relations and pattern, that are hard or even awkwardly attained by other approaches [117][118][119]. The basic advantage of MARS approach compare to other MLAs is its ability to simplify the challenges of a nonlinear relationship between dependent and explanatory variables. As an alternative, MARS structures a relation from established basis functions (BF) and coefficients, which are generally determined from the regression information (www.statsoft.com). Certainly, the areas in which this model have been applied include landslides [11,31,42,82,83,90,116,[120][121][122][123][124][125][126]. The superiority of MARSpline over some MLAs have been demonstrated in other fields of research [81,125,[127][128][129][130].
Meanwhile, the second model lies in the principle of SVM (R-SVM) approach; established on statistical learning concept. The model was adopted due to the intricacies of procedures in geosciences where the relationships among different data types are usually non-linear. Therefore, it was is recommended to employ R-SVM in a situation of mild nonlinearity; sometimes validation and reduction in data dimension can be handled in the SVM algorithm. The approach processing or runs time is smaller compared to other data-mining algorithms (neural network). Various researchers have adopted this MLAs separately [42,90,99,122].

Materials and Methods
This study is conducted in Cameron Highlands, around the Ringlet area of Pahang in Peninsular Malaysia. Geographically, the study area lies within latitude 4 • 22 52" N and 4 • 25 48" N and longitude 101 • 22 30" E to 101 • 23 30" E ( Figure 1) covering an area of about 25 km 2 . This area is specifically chosen because of the frequent occurrences of landslides and slope instability that often results in debris-flow [47,79]. In terms of physical development, the study area experiences rapid developments, which have greatly altered the natural slope, potentially, putting the community living in downslope in danger. Generally, the topography of the Cameron Highlands area is underlain by rising and falling plateau ranging between the elevation of 1000 m and 1676 m a.s.l. [47]. The geological formation comprises of granites of various categories of megacrystic biotite and associated micro-granite (muscovite) [48,55,131] and meta-sediments comprising of slate, (Sedimentary rock) phyllite, limestone, sandstone and schist (metamorphic rock) [61]. The mean annual rainfall records range from 2500 to 3000 mm per annum. There are two raining seasons in the study area, which include; the September to December regime and the February to May regime but the peaks are recorded between March to May and November to December respectively. During the peak period, rainfall of between 88 mm to 100 mm in a single day usually triggers landslides [6,11,50,52,55]. When the local rainfall rate exceeds a distinctive intensity threshold, a runoff-generated debris flow may likely be initiated.
The occurrence of debris-flow events in this region of complex terrain is strongly induced by ubiquitous slopes-bed deposits of weathered-source grain sediments and weather system. Such as recurrent monsoon in the wet season is associated with thunderstorms that result in a locally intensive downpour; the recorded rainfall intensity is sufficient to trigger debris flow [17,20,26]. The risk presented by debris flow is dynamic and the threat related to a hydrological hazard could aggravate severe hazard in the near future due to modifying climate pattern in the tropical region that could change the precipitation system.

Data, Software and Techniques
The systematic procedures utilized to achieve the objectives of this research paper is summarized and illustrated in Figure 2. The datasets used in this study are ALS data, orthophoto and landslide-inventory data. The site was visited twice in 2017, first between 15th and 18th May and later from 19th to 23rd December. Twelve conditioning factors comprising of topographic and geomorphic parameters were derived from 5 m resolution LiDAR-derived DEM to model debris flow event using SVM and MAR Splines algorithms. The inventory data was assigned to the dependent variables while the conditioning factors represent the independent variables. Data processing was accomplished using different software packages these include Whitebox (open source software; https://www.whiteboxsoft.com), ArcGIS 10.6 (ESRI, Kuala Lumpur, Malaysia) software environment, LAStool in ArcGIS, Salford predictive Modeler 8.1 (Salford System, San Diego, CA USA), Statistica (StatSoft,Tulsa, OK, USA) and Waikato Environment for Knowledge Analysis (WEKA) version 3.9 (University of Waikato, Halmilton, New-Zealand).

LiDAR Data Acquisition and Preparation
Airborne LiDAR data was acquired with discrete pulse return systems over Cameron Highlands in support to a regional scale assessment of landslides [11,52,132]. The LiDAR database comprises of pulses (beam), gathered in the scope of PBRC Project, 10th Malaysia Plan (RMKe-10), through Minerals and Geoscience Department Malaysia (JMG) in the year 2015. The LiDAR data was captured on the 15 January 2015 by a Malaysia-based Company, using a scanner aboard EC-120 Helicopter flown at an average altitude of 1676 m above m.s.l. The sensor was calibrated to sample at a 25,000 Hz pulse rate that yields ground point density of 8 points/m 2 , 0.5 point spacing with a vertical accuracy of 15 cm on vegetated terrain and horizontal accuracy of 25 cm, fall within the acceptable threshold value for LiDAR processes in the area of land formations [133]. The acquisition exercise, footage stored a number of pulse-returns together with the first and last returns. In addition, an orthophoto image (0.09 × 0.09 m) was obtained during the LiDAR campaign, (Figure 1) in the scanned area using EOS5D Mark III camera with a focal length of 35 mm mounted on the aircraft. The camera has a horizontal and vertical resolution of 72 dpi, respectively, and exposure time of 1/2500 s.

Data, Software and Techniques
The systematic procedures utilized to achieve the objectives of this research paper is summarized and illustrated in Figure 2. The datasets used in this study are ALS data, orthophoto and landslide-inventory data. The site was visited twice in 2017, first between 15th and 18th May and later from 19th to 23rd December. Twelve conditioning factors comprising of topographic and geomorphic parameters were derived from 5 m resolution LiDAR-derived DEM to model debris flow event using SVM and MAR Splines algorithms. The inventory data was assigned to the dependent variables while the conditioning factors represent the independent variables. Data processing was accomplished using different software packages these include Whitebox (open source software; https://www.whiteboxsoft.com), ArcGIS 10.6 (ESRI, Kuala Lumpur, Malaysia) software environment, LAStool in ArcGIS, Salford predictive Modeler 8.1 (Salford System, San Diego, CA USA), Statistica (StatSoft,Tulsa, OK, USA) and Waikato Environment for Knowledge Analysis (WEKA) version 3.9 (University of Waikato, Halmilton, New-Zealand).  The data pre-processing task commenced with a subset of the area of interest that covers a span of (5 km × 5 km) 25 km 2 . LAStool of ArcGIS software package was utilized to filter out the other returns (vegetation and building) owed to their insignificance to this study. Keeping single LiDAR point cloud records size representing the ground surface, which is of interest to this research. First and last pulse return that made-up point cloud using morphological filtered approach to separate ground points from non-ground points, the latter represented the surface terrain morphology of the area and is selected for further analysis. The statistical reports have shown the LiDAR total point return is 3,998,752 where 2,848,914 (71.25%) was classified ground and 1,149,838 (28.75%) non-ground using LAStools add-in in ArcGIS software. The filtered ground LiDAR return points cloud dataset was interpolated into raster grids using a natural neighbor algorithm (NNA) via Triangulated

LiDAR Data Acquisition and Preparation
Airborne LiDAR data was acquired with discrete pulse return systems over Cameron Highlands in support to a regional scale assessment of landslides [11,52,132] cm on vegetated terrain and horizontal accuracy of 25 cm, fall within the acceptable threshold value for LiDAR processes in the area of land formations [133]. The acquisition exercise, footage stored a number of pulse-returns together with the first and last returns. In addition, an orthophoto image (0.09 × 0.09 m) was obtained during the LiDAR campaign, (Figure 1) in the scanned area using EOS5D Mark III camera with a focal length of 35 mm mounted on the aircraft. The camera has a horizontal and vertical resolution of 72 dpi, respectively, and exposure time of 1/2500 s.
The data pre-processing task commenced with a subset of the area of interest that covers a span of (5 km × 5 km) 25 km 2 . LAStool of ArcGIS software package was utilized to filter out the other returns (vegetation and building) owed to their insignificance to this study. Keeping single LiDAR point cloud records size representing the ground surface, which is of interest to this research. First and last pulse return that made-up point cloud using morphological filtered approach to separate ground points from non-ground points, the latter represented the surface terrain morphology of the area and is selected for further analysis. The statistical reports have shown the LiDAR total point return is 3,998,752 where 2,848,914 (71.25%) was classified ground and 1,149,838 (28.75%) non-ground using LAStools add-in in ArcGIS software. The filtered ground LiDAR return points cloud dataset was interpolated into raster grids using a natural neighbor algorithm (NNA) via Triangulated Irregular Network (TIN) to produce a digital terrain model (DTM) of 5 m cell size utilizing the ArcGIS 10.6 software tool. Natural neighbor presented a ubiquitously smooth surface, but the position sample differs. This was chosen because of Baker and Coop [134], who stated that no technique of interpolation is superior to another when used for the interpolation of a DTM; thus they observed the model conservative and has the lowest error term. NNA adapted the principle of autocorrelation to the added inverse distance to a power function and related weights to the input samples in proportional areas to interpolate a value. Hence, the LiDAR data was used to produce a DEM of the study area at 5 m resolution. The LiDAR-derived DEM derived defines elevation of surface in term of planes, valleys and crests.
Obviously, no ideal grid resolution exists, but rather a range of suitable resolution [133] despite its significant role-plays in terrain efficiency mapping. Selection can be optimized to a certain level to satisfy both processing capabilities and representation of spatial variability. A grid resolution that optimally represents diverse terrain surface is required that is capable to reflect the majority of geomorphic attributes. A finer grid resolution can introduce local artefacts and slow down computation of terrain parameters. Kienzle [135,136] pointed a systematic overview of the effects of various grid resolutions on the reliability of terrain parameters suggesting fine grid resolutions from 5 m to 10 m. A LiDAR-derived DEM was generated at 5 m cell size signifying the LiDAR post spacing [135,137], exhibit hydromorphic structures used to generate the debris-flow conditioning factors.
Very high-resolution offers the prospect of the greater reliability of topographic features, which improves the identification of debris flow. The topographic reflectance dataset was utilized to generate primary and secondary terrain surface derivatives. Nevertheless, it could produce inherent issues related to the surface; generate anisotropic effects in the spatial resolution, generating artefacts. Pits in the DEM were filled to sieve out undulating artefacts to attain hydrological connectivity [138]. For computational efficiency, lessen prediction errors and increase model accuracy, the original DEM 0.5 m resolution data was spatially resampled to 5 m subsequently. According to Keijsers et al. [139], fine DEM resolution represents geomorphic and topographic configurations better than coarse DEM. Research has demonstrated the effects of different DEM resolutions on susceptibility mapping which indicates low DEMs resolution hinders true representation of terrain attributes because it smoothens local surface variability [139]. Based on the assumption that terrain conditions that initiate an event in the past will likely lead to a reoccurrence of such event in future, the LiDAR DEM was used to derive both the primary and secondary conditioning factors utilized in this investigation.
Statistical analysis of the debris flow inventory data and its conditioning factors were combined to formulate an independent input variable utilized to predict area susceptible to debris flow. This allows assessment of areas prone to debris flow for emergency planning. Researchers [70,71,77,85,99,107,115,[140][141][142][143][144][145][146][147] have identified the relationships between earth movements and the conditioning factors associated with morphological, topographical, climatological, hydrological and human activities. However, the accuracy of debris-flow susceptibility mapping depends on the high quality of these input model factors data.
Unlike land use and climate variables which change rapidly over a short time, lithology and soil properties were all not considered as part of the conditioning factors in this study because they are assumed to be constant over a longer time scale [139]. In addition, rainfall data was exempted from the model because it is considerably uniform [47,148] over the small area.

Derivation of Conditioning Factors
From, the thorough review of relevant literatures e.g., [38,60,115,[149][150][151][152][153][154][155] (Figure 3l). Generally, these variables are critical factors that control the occurrence and characteristics of mass movement and are widely adopted in geo-hazard modelling and assessments e.g., [7,23,38,39,42,48,81,89,156]. In addition to the parameters discussed above, a number of morphometric related indices were also calculated and included in the modelling. First is the Topographic Wetness Index (TWI) which involves the upslope contributing area, a slope raster and a number of geometric functions [171,172] . It is important to consider carefully the flow accumulation and slope because how they are calculated will strongly affect the TWI values. The value of each cell in the output raster is the value for the flow accumulation raster for the corresponding DEM. TWI is mathematically expressed as [90]: where ln is the natural logarithm, As is the flow accumulation and is the slope angle. Stream Power Index (SPI) as a hydrological factor is a product of catchment area and slope. SPI is used to identify suitable locations for soil conservation measures to reduce the effect of concentrated surface runoff [17,20,87]. It has a considerable impact on debris flow development which indicates the potential of a channel to cause erosion [77]. SPI is generated using Equation (2): Sediment Transport Index (STI) accounts for the effect of topography on erosion. The 2D catchment area is used instead of the 1D slope length factor as in the Universal Soil Loss Equation: (e) (f)  The elevation factor was extracted from LiDAR-derived DEM, which describes the general configuration of topography. It provides height value for each point on the surface relative to the reference datum, in this case, the mean sea level. Area of low altitude collects surface runoff rapidly from the slope angle, which allows debris deposition. High elevation initiated debris, flow on varying slopes and angle. The elevation was reclassified into five classes using the natural break classification scheme e.g., [85,87,98,110,122].
Burrough and McDonnell [157,158] recommended Horn slope algorithm to be more appropriate to estimate slope angle utilizing remotely sensed LiDAR-derived DEMs on rugged terrain; most importantly this algorithm is embedded in ArcGIS software (ESRI). Hence, slope angle was determined using the surface analysis tool of spatial analyst extension in ArcGIS 10.6 software environment, using elevation as input data. This factor is a measure of a tangential surface robustly potential proxy for debris flow occurrence. Thus, slope gradient or angle is one of the core conditioning factors in debris flow mapping and modelling [11,115,153,159,160] and terrain morphology. Geomorphometric applications are effectively determined on the local slope, calculated in the direction of undue rapid gradient from discrete pixel to its diagonal or adjoining grid cells. Different activities take place at different slope range, for example, at slope angles between 15 • to 35 • , debris flow initiation occurs [27,102,113,[160][161][162][163][164] whereas run-out and deposition mechanism occurs at a slope between 15 • to 20 • and <9 • , respectively [12].
Another slope related parameter is the slope aspect. Slope aspect was computed with surface analysis tool via spatial analyst tool of ArcGIS 10.6 software environment. The output was converted from radians to degrees [165], the values range from 359 • to 1 • representing a cardinality (direction) of the surface. While total curvature depicts the convex or concave terrain of neighboring grids, aspect describes the direction, which reveals a surrogate for exposure to the climatic factors. This shows the horizontal direction of the slope of a mountain orientation [77], which activates weathered and eroded materials to be transported down the slope in the study area. As earlier discussed, most of the past debris-flow hazards in the area were triggered by landslides (new or old) combined with heavy downpour (duration and intensity) during the monsoon seasons, with a defined direction of wind flow [166,167]. For example, during the monsoon season, the wind flows in Southeastern and Northeast directions [166], noteworthy to mention that the NE monsoon wind direction and movement bring out heavy rainfall while the SE monsoon is a dry wind. This heavy downpour triggers flooding and landslides. The slope aspect was expressed in a unit degree (0 • -360 • ), oriented clockwise directions starting from the north position, The facet of the terrain defines the effects of weathering and the amount of rainfall in the area based on the level of the exposed surface [77].
TPI is a degree of association between the elevations of a principal position to the elevation of adjacent terrain (nearest neighbor), describing planes trench, and crests. TPI was estimated following the methods of [168,169] focal statistic tool was used with radius neighbourhood shape, maximum and minimum kernel window cell size (3 × 3) and (200 × 200) [165].
The flow accumulation parameter defined the drainage pattern of the study area. The parameter presents the cumulative amount of the number of pixels that logically drain into outlets [38]. The collective hydrologic flow value signifies the number of input pixels that contribute to the channel. To avoid misleading result, sinks in the DEM was filled before it is used to produce flow direction and subsequently flow accumulation.
Some debris-flow conditioning factors (Total, profile and plan curvatures) showed a positive relationship with flow accumulation. Profile curvature was estimated using the slope simplification approach revised by [170]. The profile curvature defines two distinct characteristics of the topography, convexity (positive) and concavity (negative) curvature of the terrain surface [77,142]. The concave curvature accumulate water and exhibit convergent of soil and hold deposit sometimes indicating positive pixel value on an image. While convex type of curvature does not hold any earth material and indicated by negative values and a flat surface with a zero value.
In addition to the parameters discussed above, a number of morphometric related indices were also calculated and included in the modelling. First is the Topographic Wetness Index (TWI) which involves the upslope contributing area, a slope raster and a number of geometric functions [171,172]. It is important to consider carefully the flow accumulation and slope because how they are calculated will strongly affect the TWI values. The value of each cell in the output raster is the value for the flow accumulation raster for the corresponding DEM. TWI is mathematically expressed as [90]: where ln is the natural logarithm, As is the flow accumulation and β is the slope angle. Stream Power Index (SPI) as a hydrological factor is a product of catchment area and slope. SPI is used to identify suitable locations for soil conservation measures to reduce the effect of concentrated surface runoff [17,20,87]. It has a considerable impact on debris flow development which indicates the potential of a channel to cause erosion [77]. SPI is generated using Equation (2): Sediment Transport Index (STI) accounts for the effect of topography on erosion. The 2D catchment area is used instead of the 1D slope length factor as in the Universal Soil Loss Equation: where Z i,n , represent the standardized value for Z i , which signifies separate data point, Z min and Z max are the minimum and maximum in the datasets.
In data pre-preprocessing stage, we normalized individual datasets which transform the factor values to a common standard dimensional level. Because of the large variability observed in the range values of the processed conditioning factors. Each conditioning factors was normalized using (Equation (4)) that produced a pixel value ranging between 0 and 1, purposely to unify the conditioning factors, reduce high dimensionality differences in the modelling task [38], also allowing relative estimate and combination. Parameter values standardization can improve the model response, action that tends to stabilize variance, remove nonlinearity, and counter non-normality (SAS tutorial, 2019) and leads to better model fits.

Inventory Data for Model Training and Validation
Inventory data is vital to debris flow susceptibility analysis [6]. Based on existing historical records, 320-point locations of previous landslides/debris-flow events were obtained. The validity of the points was verified during the site visit using handheld GPS (GERMIN GPSMAP 78S) and the information provided by the local residents. Another 320 points representing non-debris-flow locations were randomly selected and combined with the inventory data for the modelling. The 640 points were labelled 1 s and 0 s for the classification process where the former indicates the presence of debris flow (in the past) and the latter represent the absence of debris flow [55]. For each point location, corresponding cell values of the normalized conditioning factor datasets were extracted using multi-values to point tools in ArcGIS 10.6 software and randomly split into training and test datasets using ratio 70% (448) to 30% (192) to develop and validate the resulting debris flow susceptibility model [38,55,78,104,111,112,115,154,[173][174][175]. As earlier mentioned, MARS and SVR data mining techniques were adopted and these MLAs often accept binary classifications [107].

Debris Flow Susceptibility Modeling
Data pre-processing carried out to select relevant factors that have no influence on others but have strong relationships with the conditioning factors using the correlation approach and explores the factors accuracy metrics information value (IV) and Gini coefficient. Machine learning approaches of SVM and MARS are evaluators considered in debris flow susceptibility mapping.

Debris Flow Conditioning Factor Selection
Several debris-flow conditioning factors have been experimented in many studies as input into data-mining algorithms. However, the selection process of the parameters is mostly not given proper consideration [176]. The inclusion of input parameters into modelling algorithms using the expert or a priori knowledge of the phenomenon could affect the quality of the model output [101,176]. The quality of a model's output does not necessarily depend on the number of input parameters. Studies have established that the use of a huge number of factors practically increases the cost of data collection and processing time in addition to the possibility of generating misleading regression estimate [6]. Therefore, testing the credibility of input parameters by optimizing the conditioning factors is essential for reliable debris flow susceptibility modelling.
A very useful concept of variable selection in debris flow susceptibility model was done by optimizing information value (IV), utilizing Gini and Crammer values [42] under spearman correlation to examine the multi-collinearity effects in the predictor (input variables). This step allows identifying and subsequently discarding unrelated and out of range dataset, that usually affects the predictive accuracy of the model. It has been reported that a good feature subset is one that contains features highly correlated with the target, yet uncorrelated with each other [177]. IV (Equation (5)) is a statistical method for feature prediction that removes insignificant factors at 95% confidence level. The predictive power is expressed as such; <0.02 (bad predictor), 0.02 to 0.1 (weak predictor), 0.1 to 0.3 (moderate predictor), 0.3 to 0.5 (strong predictor) and >0.5 (Excellent predictor), which can also be expressed in percentage (%). Usually, variable all within moderate and strong predictive power are selected for model building. Optimum conditioning factors were selected using the IV, mathematically expressed as: where k is the class size of individual conditioning factors, g i represent the distribution of good and b i denote the distribution of bad is the column-wise in percentage distribution of the total "debris flow" circumstances in the ith case, b i is the percentage distribution of the total "no-debris flow" circumstances. The Gramer's V statistic and Gini coefficient were calculated on individual conditioning factor. Gini coefficient is a measure of inequality in a given population and has an index value ranges from 0 (all factors are equal) to 1 (strength of affinity in an infinite variable). In addition, the Cramer's V quantify the relationship amongst the conditioning factors and also, its value range from 0 (no relationship) to 1 (perfect relationship). The Cramer's V is expressed as [92,116]: where x 2 is the Chi-square, n is the number of entries existing in the dataset, w is the size of the categorical dependent variable (i.e., "non-debris flow" and "debris flow" (=2) and k is the size of factors present in the explanatory variables (i.e., the 12 debris flow conditioning factors).
Since the conditioning factors are non-parametric datasets, we adopted multiple regression using Spearman correlation method to further evaluate the relationships between the predictor and the conditioning factors (i.e., check for multicollinearity effects) purposely to indicate the level of insignificant relationship between the conditioning factors. Finally, all the conditioning factors that meet this hypothesis were selected and included in the modeling. Some of the input factors with a risen noise negatively contribute to the model building.

Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Splines (MARS) first put forward by [178] as one of the novel artificial intelligent approaches. MARS is a flexible, swift and accurate model for predicting continuous binary effects [116]. This approach integrates the mathematical structure of splines and the classical linear regression. The binary recursive splitting and instinctive search algorithm make it a model which is capable of predicting values from a given target variable (continuous and categorical) obtain from a dataset (independent variables). MARS employs basis function (BF) (Figure 3), a non-parametric and non-linear procedure that adopts discretization regression gradients at diverse intervals of the data frame for estimations. De-Andres et al. [116] noted that the recursive segregating approach that controls generalized additive modelling (GAM), classification and regression tree (CART) jointly motivates the MARS technique. The elasticity of the MARS method makes it possible to model correlations that embrace collaboration with limited parameters [116]. MARS is best at identifying optimal variable connections and transformations that expose intricate hidden data models successfully. It is also the capability of revealing a pattern that is rarely possible for other approaches to achieve [117].
The principal gain of the MARS model is that it clarifies the difficult in a nonlinear relationship between dependent and independent variables [126]. To do this, it approaches the underlying function through a set of piecewise functions, basic functions (BF), also known as Splines Polynomials [128,130]. The BF represents the information contained in one or more independent variables and was selected through a systematic process. The MARS model f (x) is developed as a linear combination of BFs and the connections between them. MARS model is express in (Equation (7)) as: where y is the model expected value through a function f (x) which could be disintegrated into an initial constant β 0 , β m is the coefficient of the mth BF, h m (x) is a BF, and n is the number of BFs in the model [88,179]. BFs are functions that proceed in either twofold modified formula as [90,122]: where x is an independent (0 factors) variable and α is a constant, represent the node-to-node position. MARS utilizes truncated power functions as spline basis functions [180]. The Dual contiguous splines overlap at the knot to preserve the occurrence of the BF [122,127].
To generate the final MARS model, first, the forward stepwise algorithm (Equation (7)) was employed where a number of successive pairs of BFs of the model are added to obtain optimal performance [128,178]. This algorithm introduces complex over-fittings, which usually produce a poor estimate of the predicted results [116]. So, a second backward stepwise algorithm was applied to resolve the over-fitting by removing redundant BFs from the MARS model using Generalized Cross-Validation (GCV) (Equation (9)) [134]. The GCV limits the number of BFs to be included in the performance [128,130,179,[181][182][183]: where N is the number of dataf (x i ) is the projected value of the MARS model and C(B) is a complexity penalty that rises with the number of basis functions (BF) in the model. C(B) is obtained from Equation (10): where d, a penalty is for each BFs included in the model as a smoothing factor, and B is the number of basis functions in Equation (7). The penalizing parameter (d) signifies a cost BF optimization [128]. Increasing the number of d influences the size of the knotss while assigning higher values for d allows the use of less number of knots, smoother function estimates and small BFs [116]. The optimal value for the penalizing factor is between 2 ≤ d ≤ 4; however, Friedman [178] recommends 3. In the current work, a value of 3 is assigned to the penalty factor d similar to Zhang et al. [128]. In addition, the maximum interaction BFs level is limited to 10 for MARS approach with the optimal model size M * of 2 [130].
The MARS model was also used in the optimal feature selection process [26,81]. The contribution of each conditioning factors in the debris flow model was assessed in ANOVA. This procedure involves all the BFs related to both the individual factor and pairwise interaction aggregated separately through analysis of variance decomposition for statistical significance [178,183]. This was achieved by evaluating the coefficient of determination R 2 , the coefficient of correlation r, mean average error (MAE), root mean squared error (RMSE), performance index ρ and receiver operative curve (ROC). Finally, the performance rate and index are compared with SVM.

Support vector machines (SVM) is a supervised machine learning technique developed by Vladimir
Vapnik and his colleagues in 1995 at AT&T Bell Laboratory [184]. SVM was originally developed for classification (C-SVM) and later adopted to solve regression (R-SVM) problems called support vector regression (R-SVM) [185,186]. The algorithm is based on statistical learning theory and structural risk minimization capable of addressing linear and non-linear multivariate regression problem [187]. SVM separates two opposite analogous hyperplanes and get the best separating distance between two vectors [106,127]. If the space between them is large, then the overall error of the classifier is reduced. The relationships among different categories of data in geosciences are mostly nonlinear. Therefore, it is recommended to adopt C-SVM also in nonlinear problems. Details on the SVM algorithm can be found in [102,105,188]. Mathematically, R-SVM is expressed as: where, w is the coefficient vector that defines the hyperplane positioning in the feature space, b is the offset of the hyperplane from the source, and ξ i is the positive loose parameters [189,190]. To achieve optimal hyperplane, an optimization problem is resolved using the following mathematical equation [189,190]: where α i is the Lagrange multiplier, and C is the model penalty. Thus, the decision function is demonstrated as follows: In an instance where the separating sample is nonlinear, the decision function is otherwise written as: In this study, the initial data was converted to higher dimensional space by means of kernel functions (K). A number of kernel functions are available in SVM (polynomial, linear, sigmoid and radial basis) [190,191]. Prominently and most widely adopted kernel for mass movement applications, the radial basis function (RBF-Equation (15)) [78,101], is used in this study: where δ 2 is the bandwidth of the RBF used to adjust the generalization, x i is the i-th support vector, and x j is the j-th support vector. The hyperplane function can be described as: where ω and b are the coefficient. The critical limitation conditions over an original optimization problem based on the SVR model explained as follows:

Model Validation and Evaluation
To evaluate the debris flow model performance, we used Receiver Operating Characteristics (ROC) sensitivity, specificity and accuracy evaluation techniques [6,104,125,192]. Area Under the Curve (AUC) values were used to compare the success and prediction rates [11,149,193,194]. ROC curve assesses the model's capability to differentiate between the presence and absence of even represented in the specificity versus sensitivity measures (Figure 4). The ROC curve plots the false positive rate on the X-axis and true positive rate on the Y-axis. A good model has indicated with ROC curve fitted upper-left side and AUC value close to 1. Ordinal association between the model's success rate is adjudged to be poor (0.5 to 0.6), average (0.6 to 0.7), good (0.7 to 0.8), very good (0.8 to 0.9) or excellent (0.9 to 1.0) [92,195].
Additionally, global performances of the model were evaluated to ascertain the power of the model in terms of prediction and capability strengths presented as success rate and prediction rate respectively; using the ROC/AUC (Equation (9)) [6,11,52,99,107,192,196]. The ROC curve is a dual scale plotted graph of Equation (18) against Equation (19). The curves are plotted on a Cartesian graph, calibrated on both positive axes (y-axis and x-axis) or vertical and horizontal lines with a scale units range from 0 to 1. It has been reported that the closer the curve toward the upper left of the vertical axis, the superior the model performance [99]. The quality of a model prediction or classification is demonstrated in a quantitative AUC value of categories scale as follows: Speci f icity = TN TP + TN where TP is the True positive, TN denote the True negative.

Results
In this study, twelve conditioning factors (Table 1) were prepared and optimized to predict area susceptibility to debris flow in Cameron Highlands. The multi-collinearity effects amongst the twelve conditioning factors and important contribution in the debris flow modelling using MARS rank are shown in Table 1. The Information Value (IV) indicated that profile curvature has an insignificant influence of about 28% on the other variables meaning that, it has no collinearity effect and strongly agreed by Cramer's values (26%) and Gini coefficient (7%). The second level is the slope aspect with 40% IV, 30% Cramer's Value and 23% Gini coefficient and is the most important variable in debris flow prediction with 100% MARS score function. RSPI was observed to have the highest multi-collinearity effect under IV of 94%. This is evidently shown on other evaluation metrics, which considered as a bad predictor and discarded from the advance analysis. The Cramer's V relationship score reveals no significant correlation between each variable, which coincided with [176] this translate to no multi-collinearity effect present in the variables at 95% confidence level. In Gini coefficient, SCA, RSPI, TWI, STI and Profile Curvature demonstrated the presence of multicollinearity effects; this shows a certain level of disagreement, especially with Profile curvature. To satisfy the requirement for feature subset selection that is highly correlated with the target, there is a need for further evaluation on the training data [176].
Further analysis on the contribution of each variable through an optimization process with MARS algorithm indicates that only seven factors significantly influence the outcome. The measure of relative importance ranked in percentage is presented in (Figure 4). The most contributed or important conditioning factor in order of affinity is the slope aspect (100%) scored value and the least is STI (57%). Slope aspect contributes most to debris flow susceptibility with a measure of 100%; possibly because of the monsoon airstream bearing on the topography, which tends to make the high-slope side much more prone to debris flow initiation. Closely following the slope aspect is profile curvature, total curvature, slope angle and plan curvature with 97.41%, 86.82%, 84.46% and 77.66%, respectively. The other relative important variables are topographic wetness index and sediment transport index with 56.71% and 56.56% respectively. Similar feature subset selection was established by Jakob and Horton et al. [71,115].
The remaining five (TRI, elevation, RSPI, SCA and TPI) variables yielded zero or no contribution to the model building despite their theoretical contribution in mass movements modelling.

Debris Flow Susceptibility Mapping with MARS Approach
In this research, 15 BFs and knot locations generated and shown in Table 2, BF used to describe individual interval of the conditioning factors. The debris flow susceptibility map generated by MARS model was produced using Equation (22) and rendered in Figure 5. This is executed in ArcGIS

Debris Flow Susceptibility Mapping with MARS Approach
In this research, 15 BFs and knot locations generated and shown in Table 2, BF used to describe individual interval of the conditioning factors. The debris flow susceptibility map generated by MARS model was produced using Equation (21) and rendered in Figure 5. This is executed in ArcGIS 10.6 software environment, MARSpline Salford Predictive Modeler (SPM) 8.1 software (https://www.salfordsystems.com); that generated the susceptibility map as shown in Figure 5. Table 2. BFs and knot locations.

No. Relationship
The relationships between the explanatory variables and the contribution of the target variables are demonstrated on the response surface curves (Figure 6a-h). The figure explains the interaction between a pair of variables. Other contributions of the slope aspect, STI in describing the spatial distribution of debris flow shown in Figure 6c,d,f-h.
The seven optimal conditioning factors subset (Table 1) were utilized as input variables into MARS and R-SVM algorithm that produces the debris flow susceptibility model and generates a regression equation (Equation (20)) for the final debris flow susceptibility map ( Figure 5). MARS model outputs susceptibility index value range is from 0.005 to 0.9999. This was further classified into five levels, which include Not Susceptible, Low Susceptible, Moderate Susceptible, Susceptible and Highly Susceptible using equal interval classification method ( Figure 5).
The seven optimal conditioning factors subset (Table 1) were utilized as input variables into MARS and R-SVM algorithm that produces the debris flow susceptibility model and generates a regression equation (Equation (20)) for the final debris flow susceptibility map ( Figure 5). MARS model outputs susceptibility index value range is from 0.005 to 0.9999. This was further classified into five levels, which include Not Susceptible, Low Susceptible, Moderate Susceptible, Susceptible and Highly Susceptible using equal interval classification method ( Figure 5).

Model Performance Evaluation
The general quantitatively produced performance accuracies of the debris flow model based on the ROC curve approach is shown in Figure 7a,b success rate and prediction rate by equating the two susceptibility maps with the debris flow training and validation datasets respectively. From the result, we observed that the MARS and R-SVM algorithms showed promising performance for debris flow susceptibility evaluation AUC between 0.72 and 0.92. The plot of the ROC curve Figure 7 and analysis of the indicators provided a quantitative appraisal of the model performance.

Model Performance Evaluation
The general quantitatively produced performance accuracies of the debris flow model based on the ROC curve approach is shown in Figure 7a,b success rate and prediction rate by equating the two susceptibility maps with the debris flow training and validation datasets respectively. From the result, we observed that the MARS and R-SVM algorithms showed promising performance for debris flow susceptibility evaluation AUC between 0.72 and 0.92. The plot of the ROC curve Figure 7 and analysis of the indicators provided a quantitative appraisal of the model performance.  Performance evaluation of the two models using ROC curve indices shows that the MARS model transcended R-SVM model with AUC values of 0.9345 and 0.7587 respectively. Using the validation dataset, the model prediction rate produced AUC values of 0.831 and 0.721 for MARS and R-SVM models respectively. MARS have the highest success rate and prediction rates 0.935 and 0.832 respectively. The R-SVM model has the lowest measure of accuracy with a success rate of 0.7587 and prediction rate of 0.721.

Discussion
Studies such as [11,55,115,160,[197][198][199] have confirmed the significance of slope aspect, profile curve, total curve and slope gradient to geo-hazard modelling. They observed the significant

Discussion
Studies such as [11,55,115,160,[197][198][199] have confirmed the significance of slope aspect, profile curve, total curve and slope gradient to geo-hazard modelling. They observed the significant contribution of these conditioning factors (slope angle, plan curvature and flow accumulation) to debris flow initiation. In this study, STI, Roughness, Elevation, SPI, SCA and TPI were eliminated from advance analysis because they have zero scores and therefore lack predictive ability. Delmonaco et al. [200] pointed out that sediment availability, water input and slope gradient are the three main factors that directly or indirectly affect the possibility of debris flow initiation. Plan curvature, TWI and STI on the other hand, reveal the proportion of soil water content associated with the soil characteristics and denudations [198,201]. Nevertheless, the TWI is related more to debris flows than to landslides in low-order channels [124,162,202]. A high TWI value suggests the presence of surface depressions that prevent surface runoff and store surface water [48]. The factors excluded during the optimization process (elevation, SPI, surface roughness, SCA and TPI) reduces the objective function of the model (minimizing the error term). The excluded conditioning factors were considered redundancy was not involved in the prepared final modelling input independent variables; however, it does not mean they are not useful in the development of debris flow processes elsewhere. For instance, several studies have highlighted the role of elevation, SCA and SPI in debris flow and landslides in different environments [38,89,90,94,203].
The response curves relationship identified the contribution of profile curvature and TWI in debris flow potential modelling, this could be responsible for the hydrological effect on the debris flow catchment area. Thus, refers to an explanatory factor of fluid accumulation at a catchment area and signpost the tendency for debris flow propagation or run-out downslope due to the influence of gravitational forces [77]. Water permeation effects on soil strength divulge and reactivate the flow of debris from slope curvature of water to the lower values of profile curvature, which contributes most. From Figure 6e, slope gradient contributes intensively to the debris flow initiation to deposition at an angle range of 15 • to 35 • [11,115,173,204] and deposition zones (≤9 • ).
The quantitative susceptibility indices range from 0.005 to 0.9999 and 0.005 to 0.776 for MARS and R-SVM respectively. A pixel with 0 value indicates no possibility of debris flow while a pixel value of 1 indicates absolute debris flow of 100% possibility. The possible values need to be split into a specific number of classes that construct qualitative susceptibility maps to debris flow [105]. A meaningful decision can be derived from the maps via a nominal representation of the results. These were subjected to five qualitative classes the using natural break scheme, due to its adherence to a specific framework that manipulated the class into non-user defined categories [77]. The catchments were categorized into five areas of susceptibility levels; Not Susceptible, Low Susceptible, Moderate Susceptible, Susceptible and Highly Susceptible. The area in red colour denotes highly debris flow susceptibility. In contrast, visual susceptibility maps showed a glaring conflicting pixel class in Figure 5b, misclassification from highly susceptible class to moderate susceptible class. Drainages and second-order stream were predicted to be highly susceptible area to debris flow as shown in Figure 5a,b; for the purpose of mitigation measure, this zone is strongly not suitable for habitation, hence land-use developers should avoid it.
The model performance reveals MARS approach performing better than R-SVM [120,[127][128][129]. The closer the ROC plot is to the upper left corner of the graph, the higher the overall accuracy [79,205,206]. The ROC curves are plotted on a Cartesian graph, calibrated on both positive axes (y-axis and x-axis) or on vertical and horizontal lines with a scale units range from 0 to 1. It has been reported that the closer the curve tends toward the upper left of the vertical axis of the graphs, the superior the model performance [99]. MARS success rate is toward the top right axis, which can term excellent with average prediction rate.
The competence of simple LiDAR-derived conditioning factor that offers promising and consistent debris flow susceptibility analysis have been demonstrated. This has similarly been proved in flood modelling [77]. However, the model's efficiencies confirmed the significance of a feature subset selection and variables evaluation at the pre-processing stage of modelling. Although, exploration of advance feature subset selection and other machine learning algorithms could unveil different and better results.

Conclusions
This research demonstrated the use of laser scanning technology with MARS and R-SVM to develop models for the prediction of an area susceptible to debris flow, in the Ringlet region of the Cameron Highlands, which is affected by recurrent mass-movement related natural disasters. Triggered by anthropogenic and natural factors; utilized appropriate samples of training and test datasets to examine the interplays between the input factors (conditioning factors) and the dependent variable (debris flow inventory) for debris-flow susceptibility mapping. Geomorphometric and topographic elements generated from LiDAR-DEM (slope aspect, profile curvature, slope angle, total curvature plane curvature, TWI and STI) are observable factors that most contributed to debris flow prediction and susceptibility zones mapping in the Cameron Highlands. These parameters can be deemed suitable for a similar form of debris flow studies with minimal processing time and a high accuracy output.
The debris flow susceptibility maps revealed proficiencies of MARS and R-SVM algorithms with more than 72% AUC accuracy values, which yielded acceptable performance based on the success and prediction rates indicators. Interestingly, the MARS technique performed better with an accuracy difference of 2% in success rate and 1% prediction rates higher than R-SVM; this confirmed the efficacy of MARS application in debris flow prediction and detection of an area susceptible to debris flow. The pixel values generated from the debris flow susceptibility maps were classified into five using Jenk's natural break classification scheme. The debris flow zone induced by the monsoon rain that twig by landslides will be greater than the events in the absence of landslides and the elevation of the debris source is developed toward the valley or mountain bottom.
Based on the accomplishments of the models used, this study also highlighted the significant contribution of MARS approach and LiDAR-derived DEM, suggesting for adoption in an area that has similar geographical settings. LiDAR provides high-resolution topographical data that represents the land surface more effectively. Therefore, the extracted topographic parameters from the laser scanning data permit realistic representation of the terrain configuration for debris flow in somewhat more reliable in susceptibility mappings.
Future investigation will explore the transferability of this model in a different environment using similar temporal LiDAR dataset on diverse topographic characteristics, which can improve model accuracy. In addition, advancement in feature subset selection is needed especially introducing metaheuristic nature-inspired optimization algorithm with Unmanned Aerial Vehicle (UAV) data may produce better results in future.
The outcome of this research paper will contribute to identifying the area susceptible to debris flow for the purpose of significant regional planning, land use and policies making processes. Such strategies can assist the government and decision-makers to reduce the risks associated with debris-flow disaster.