Partial Least Squares Regression for Determining the Control Factors for Runoff and Suspended Sediment Yield during Rainfall Events

Multivariate statistics are commonly used to identify the factors that control the dynamics of runoff or sediment yields during hydrological processes. However, one issue with the use of conventional statistical methods to address relationships between variables and runoff or sediment yield is multicollinearity. The main objectives of this study were to apply a method for effectively identifying runoff and sediment control factors during hydrological processes and apply that method to a case study. The method combines the clustering approach and partial least squares regression (PLSR) models. The case study was conducted in a mountainous watershed in the Three Gorges Area. A total of 29 flood events in three hydrological years in areas with different land uses were obtained. In total, fourteen related variables were separated from hydrographs using the classical hydrograph separation method. Twenty-nine rainfall events were classified into two rainfall regimes (heavy Rainfall Regime I and moderate Rainfall Regime II) based on rainfall characteristics and K-means clustering. Four separate PLSR models were constructed to identify the main variables that control runoff and sediment yield for the two rainfall regimes. For Rainfall Regime I, the dominant first-order factors affecting the changes in sediment yield in our study were all of the four rainfall-related variables, flood peak discharge, maximum flood 3926 suspended sediment concentration, runoff, and the percentages of forest and farmland. For Rainfall Regime II, antecedent condition-related variables have more effects on both runoff and sediment yield than in Rainfall Regime I. The results suggest that the different control factors of the two rainfall regimes are determined by the rainfall characteristics and thus different runoff mechanisms.


Introduction
Soil erosion affects soil productivity and sustainable agriculture.Erosion by water strips the fertile topsoil on site, degrades water quality, and clogs streams, rivers, and reservoirs by transporting sediments off site [1].Suspended sediment yields represent the sum of the erosion produced by all active sources within a watershed [2].Analysis of the relationships between sediment transport, rainfall, and runoff characteristics can facilitate the elucidation of the factors and processes determining sediment responses [3].
Many recent studies have evaluated factors that control hydrological and sediment responses at an inter-event scale.Oeurng et al. [4] use a Pearson correlation matrix and factorial analysis to assess the relationships between precipitation, discharge, and suspended sediment transport to explain the hydrological and sedimentological responses in a catchment of France.López-Tarazón et al. [5] use multiple regression equations derived from Pearson correlation analysis to describe the relationships between rainfall, runoff, and sediment transport in a mountainous catchment.Wine et al. [6] apply stepwise regression to determine control factors of runoff in watersheds of the Southern Great Plains.When analyzing the relationship between rainfall, runoff, and sediment yield, multivariate statistics are commonly used to relate control factors to the dynamics of discharge and sediment yield [2,[7][8][9].
These studies enable soil and water conservationists to understand the complexities of hydrological processes.However, these statistical approaches present particular analytical challenges despite their great potential [10].Many control factors are highly correlated, which can result in redundancy.Thus, the application of these statistical approaches is somewhat limited f inappropriate approaches or unrepresentative variables are selected.Canonical correlation requires that the ratio of the number of predictors to the sample size be at least 0.025-0.05[10].Ordinary regression is hindered by limitations imposed by sample size (the number of observations).Classical multiple regression requires a large sample size relative to the number of predictors [11].The limitations of traditional multivariate regression approaches in handling multi-collinear and noisy data can be overcome by applying techniques based on multivariate statistical projection [12].Therefore, the influence of changes in each of the hydrologic variables on runoff and sediment yield must be investigated to enable more effective and accurate watershed management and prediction of the hydrological consequences of rainfall.
Knowledge of sediment yield from small watersheds is critical to understand the linkage between soil erosion processes on hill slopes and suspended sediment transport in large rivers [13].In this study, data on sediment yield for a small watershed in the Three Gorges Area (TGA) were collected.The Three Gorges Project (TGP) on the Yangtze River in China is the world's largest hydropower complex project.Following the construction of the Three Gorges Dam (1994Dam ( -2009)), millions of farmers resettled in the surrounding mountain areas and cultivated marginal lands, which are largely on steep slopes with soil of poor structure.The TGA, which refers to the riparian parts along the Yangtze valley between Yichang and Chongqing (Figure 1), contains a substantial amount of arable land on steep slopes known to be susceptible to soil erosion.Sediment yield in this area is estimated to be approximately 700 t•km −2 •year −1 [14].Soil erosion is a serious issue in this area because of long-term anthropogenic pressure, including over use and inappropriate development [15].The objectives of this paper are: (1) to quantify the contribution of control variables to runoff and sediment yield using partial least squares regression (PLSR); and (2) to investigate the effects of land use change on runoff and suspended sediment at an inter-event scale on a mountainous watershed in the TGA.

Study Area
Integrated small watershed management (ISWM) for soil conservation has developed rapidly in the TGA since the 1990s.ISWM has been conducted in more than 5000 small watersheds with an area of 96,000 km 2 , and the central government has invested 15.2 billion RMB (about 2.5 billion USD) in this project [16].As a part of the ISWM program, the Wangjiaqiao watershed was selected as a monitoring site.A national Gauging station on the outlet of the watershed was constructed in 1989.Gauging records of discharge and sediment have provided useful information for decision makers and planners since the 1990s.
The Wangjiaqiao watershed lies in Zigui County of Hubei Province, China (31°5′ N-31°9′ N, 110°40′ E-110°43′ E).The watershed is approximately 50 km northwest of the Three Gorges Dam and covers an area of 1670 ha (Figure 1).Elevations within the watershed range from 184 to 1180 m; slopes range from 2° to 58°, with an average of 23°.Two main soil great groups occur in the study watershed, namely, purple soil derived from purple sandy shale and paddy soil developed from the purple soil.According to the Soil Taxonomy of the USDA, the purple soil and paddy soil are classified into Entisols and Aquepts, respectively.The climate is subtropical, with mean temperatures between 11 and 18 °C.Annual precipitation averages 1016 mm, of which 70% occurs between May and September.Previous studies about this watershed can see Fang et al. [17] and Fang et al. [18].

Field Surveys and Land Use
Field surveys were conducted in 1995, 2000, and 2005.The watershed topographic map (scale 1:10,000) was used in combination with 1995 and 1999 aerial photographs and 2005 SPOT5 imagery.The land use types were delineated on the photographs and verified in the field.In this watershed, land use is mainly a function of elevation and topography.The remnant forest patches exist primarily on steep, inaccessible peaks and slopes.Little natural vegetation is observed, and most areas are covered by secondary vegetation under human influence.The main agricultural crops are rice (Oryza sativa L.), maize (Zea mays L.), and wheat (Triticum aestivum L.).The streams in the Wangjiaqiao watershed have a trellis drainage pattern, and the length of the main channel is approximately 6500 m.Due to the implementation of ISWM for soil conservation in the TGA in the 1990s, land use was altered between 1995 and 2005 in the Wangjiaqiao watershed.Table 1 provides the areas of various types of land use and the corresponding percentages.In 1995, forest covered 48.7% of the study area, whereas farmland covered 43.1%.The other land use types were relatively minor and consisted of shrub land (3.2%), rural residential land (4.3%), and water bodies (0.7%).During the 1995-2005 periods, some steep lands with slope gradients of more than 25° were converted to forest.This change was related to the implementation of ISWM for soil conservation in the TGA in the 1990s.During this period, forest increased to 56.9% in 2000 and to 66.3% in 2005, whereas farmland decreased to 34.6% and 24.1%, respectively (Figure 2).

Field Monitoring
A set of instruments consisting of a continuously recording rain gauge, water-level stage recorder, and silt samplers (bottle type) were used to record rainfall, stream flow, and sediment flow, respectively.The water stage was measured every 15 min and then transformed into discharge via the calibrated rating curve obtained through periodic flow measurements.Suspended sediment concentrations (SSCs) were determined by the gravimetric method.Water samples were vacuum filtered through a 0.45 μm filter, and the residue was oven dried at 105 °C for 24 h.The weight of each dried residue and the sample volume were used to determine the SSC (g•m −3 ).The suspended sediment yield (TL) was then calculated from the SSC and water discharge (Q) data.Watershed runoff and rainfall data have been collected since 1989.

Data Processes
In this study, runoff was separated between storm-flow and base-flow, using the classical hydrograph separation method [19].Despite its arbitrary nature (similar to all other methods), this hydrograph separation method was used in this study to characterize the response of the catchment to a rainstorm, and no interpretation in terms of runoff processes was derived from the separation [20].Equipment malfunctions prevented complete monitoring of all storms on several occasions.The land pattern of the study watershed varied continuously during the 1990-2005 period.Hydrograph separation was conducted on 29 events occurring in 1995, 2000, and 2004; for these events, we had reasonably completed records of sediment concentrations and conducted reconnaissance field surveys.The inner-event rainfall data for 2005 were missing, so we used data of 2004 instead of 2005.Floods were identified when the increase in stream discharge exceeded 1.5 times the base flow recorded at the beginning of the rainfall event [21].For each rainfall-runoff event, the characteristics of individual storms were evaluated based on their erosive characteristics.The flood events were subsequently characterized using three groups of variables (Table 2).
Some variables use the follow equations to calculate: where R, TQ, BF, and Dq are the runoff, total discharge, base flow, and duration of discharge respectively.
where TLi, SSCi and Qi are the suspended sediment yield, maximum flood suspended sediment concentration, and discharge, respectively, during period i.

Clustering Approach and Partial Least Squares Regression (PLSR)
Runoff and sediment generation varies considerably depending on rainfall type.Many studies have suggested that local storm patterns are important for determining runoff and sediment yield [22,23].Such rain parameters as depth, duration, and intensity play a key role in inducing various water erosion rates [24,25].Thus, prior to PLSR analysis, we used a clustering approach to distinguish rainfall regimes.Clustering approach was evaluated with the SPSS13.0statistical software package.
Partial least squares regression is a robust multivariate regression method that allows users to perform a wide range of analyses [26].Partial least squares regression provides a quick overview of the main systematic types of variation in data from complex systems and helps to identify mistakes in the input data.Applied as a multivariate calibration of one dependent variable vs. many independent variables, PLSR is suitable for selectivity enhancements of analytical instruments [26].PLSR is a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLSR derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y [27].Details on the theory, principles, and application of PLSR can be found in the literature [28].In this study, PLSR was performed with SIMCA-P (Umetrics AB, Umeå, Sweden).Four separate PLSR models were constructed to identify the main variables that control runoff and sediment yield for the two rainfall regimes.For runoff models of two rainfall regimes, runoff (R) is considered to be dependent variable and the other variables in Table 3 are considered to be independent variables.As similar, for sediment load models, total suspended sediment load (TL) are considered to be dependent variable.Table 3.General characteristics of the analyzed rainfall-runoff events recorded in the Wangjiaqiao watershed.

Characteristics of the Flood Events
Table 3 summarizes the general characteristics of the rainfall, runoff, suspended sediment, and antecedent conditions associated with the observed floods and variables as analyzed by statistical analysis.
The maximum amount of precipitation for a single event was 72.7 mm (during the event on 3 June 2004); most of the events were relatively small in magnitude.Only 10 events (34%) were greater than the average rainfall value (27.8 mm), whereas the remaining events were below the average.The mean intensity varied from 0.63 to 10.45 mm•h −1 , and eight events were greater than the mean value (2.74 mm•h −1 ).The maximum 30 min intensity ranged from 1.3 to 30 mm; 24% exceeded 10 mm during the 30 min interval, and eight of the events were greater than the average value (6.89 mm).The antecedent rainfall values varied considerably, ranging from 0 to 24.6 mm, 0 to 69.2 mm, and 0.1 to 89.0 mm of precipitation during the 1-, 5-, and 10-day previous periods, respectively.The runoff generated by rainfall varied between 3674 and 309,618 m 3 , with a mean value of 100,459 m 3 .Peak discharge oscillated between 0.05 and 5.59 m 3 •s −1 .The peak was greater than the mean value (2.20 m 3 •s −1 ) during 16 floods (55% of the total sample).The baseflow level fluctuated from 0.006 to 0.857 m 3 •s −1 .The maximum flood sediment concentrations varied from 130 to 18,200 g•m −3 ; with a mean concentration of 1740 g•m −3 .The total suspended sediment load carried by 15 floods exceeded 167 t (representing a specific suspended sediment yield of 10•t•km −2 ).The maximum yield during a single flood occurred on 4 June 2004 and reached 412,651 kg.This yield was generated by 72.7 mm of precipitation, which created a flood peak discharge of 5.44 m 3 •s −1 .These values illustrate the degree of geomorphic activity of the system and confirm the high sediment contribution and transport capacity of the channels in the Wangjiaqiao watershed, which are largely related to the availability of fine materials in the TGA areas and the accumulation of these materials along the main channel [15].
We generated a Pearson correlation matrix (Figure 3).The linear correlation coefficients among rainfall-, runoff-discharge-, sediment-, and antecedent condition-related variables are described in detail.Peak flow (Qmax) was significantly correlated with R, TQ, SSCmax, TL, and P10D.The strongest correlation was between precipitation (TQ) and runoff (Qmax).Runoff was significantly correlated with TQ, SSCmax, P1D, P5D, P10D, and BF.The results confirmed that many variables were co-linear.

Results of Clustering Approach
The 29 rainfall events were divided into two groups using K-means clustering.Three rainfall variables were used during this process: the depth (P), duration (D), and maximum 30 min rainfall intensity (I30) (Table 4).The general characteristics of the two rainfall regimes can be described as heavy and moderate rainfall (p < 0.0001).Compared with Rainfall Regime II (17 events), Rainfall Regime I (12 events) was composed of rainfall events with a high mean P and D and low I30.

Results of PLSR Analysis
Many studies have demonstrated that land use type can change runoff and sediment yield [29]; thus, the percentages of forest and farmland during the study years were included in the PLSR models.In a PLSR model, the importance of a predictor for both the independent and dependent variables is given by the variable importance for the projection (VIP) [10,12].Terms with large VIP values are the most relevant for explaining the dependent variable.To overcome the problem of over-fitting, the appropriate number of components of each PLSR model was determined by cross-validation to achieve an optimal balance between the explained variation in the response (R 2 ) and the predictive ability of the model (goodness of prediction: Q 2 ) [10].
Table 5 provides a summary of the four PLSR models constructed separately for the runoff and sediment yield of the two rainfall regimes.For the runoff model of Rainfall Regime I, the prediction error decreased with an increasing number of components, and the minimum RMSECV and maximum Q 2 were obtained with five components.An additional increase in the number of components generated a higher prediction error, suggesting that the other components were not strongly correlated with the residuals of the predicted variable [12].The first component explained 71.6% of the variance in the dataset in terms of the changes in runoff (Table 5).The addition of the second through fifth components to the models cumulatively explained 99.5% of the total variance.For the TL model of heavy rainfall, the maximum Q 2 was obtained with two components.The first component explained 80.6% of the variance in the dataset in terms of the changes in TL.The second component explained 16.2% of the variance.These two components explained 96.8% of the total variance.Further addition of components to the PLSR models did not substantially increase the percentage of the variance explained (Table 5).The PLSR weights could be used to describe the quantitative relationship between the predictors and results because they are linear combinations of the original variables that defined the scores [28].
For the runoff model of moderate rainfall, the optimum model had two components, with a maximum Q 2 of 0.646.The model explained 89.7% of the total variance, with 69.6% of the variance explained by the first component.The optimum model for the TL of moderate rainfall also contained two components.Those two components explained 88.6% of the total variance.The maximum Q 2 was 0.656.
The first component of the runoff model for heavy rainfall (Table 6) was dominated by P, Im, Qmax, TQ, SSCmax, and TL, whereas the second component was dominated by Qmax and TQ.The third, fourth, and fifth components were dominated by many variables, mainly on the negative side.A more convenient and comprehensive expression of the relative importance of the predictors was obtained by exploring their VIP values [12].For runoff in Rainfall Regime I, higher VIP values were obtained for changes in TQ, Qmax, P, Im, SSCmax, and TL (VIP > 1), followed by the percentage of forest and farmland (0.936 and 0.941) (Figure 4).Predictors with VIP values below one are considered of minor predictive importance.For runoff in Rainfall Regime II, a higher VIP value was obtained for changes in TL, Qmax, BF, P5D, and TQ.Compared with Rainfall Regime I, runoff in Rainfall Regime II was more likely to be affected by antecedent conditions, such as BF and P5D.For the TL model of Rainfall Regime I, higher VIP values were obtained for changes in P, I30, Im, Qmax, SSCmax, R, percentage of forest and farmland, and D (see Figure 5).Thus, all variables have important effects on TL with the exception of antecedent conditions and TQ.For the TL model of Rainfall Regime II, high VIP values were obtained for P, Qmax, TQ, R, and BF.Table 7 indicates that Qmax was important in the first and second components of Rainfall Regimes I and II, whereas P5D and P10D had only minor effects on TL for either rainfall regime.

Control Factors for Runoff
The hydrological response of a watershed to a rainfall event is determined by several interacting factors that control runoff generation [30].In this study, only sediment flux data from the outlet of the watershed were available, and thus, the within-watershed pattern of erosion and deposition remains uncertain.Soil moisture content is reflected in the antecedent conditions with BF, P1D, P5D and P10D.
The results indicated that TQ, Qmax, and TL are important for runoff in Rainfall Regimes I and II.The rainfall characteristics (P and Im) are more important for runoff in Rainfall Regime I, whereas soil moisture content is more important for Rainfall Regime II.The former pattern is in agreement with the results of Oeurng et al. [29].In Regime I, TQ and Qmax dominate runoff, and base flow has only a minor effect due to its small magnitude.The effect of soil moisture is greater in Rainfall Regime II than in Rainfall Regime I.The sensitivity of the runoff response to soil moisture depends on the predominant runoff mechanisms [31].Surface runoff generated by saturation-excess flow is driven from spatially and temporally dynamic variable source areas and requires lower rainfall intensities for its initiation.Both infiltration excess and saturation excess runoff processes may occur during heavy rainfall [32].Discharge is affected by drainage network density, slope, channel roughness, and soil infiltration characteristics [33], and peak flow rates are typically affected by within-storm rainfall characteristics.Many studies have indicated that storm runoff from steep well-vegetated headwater catchments in humid areas is produced by saturation-excess mechanisms [34,35].Central to the saturation-excess mechanism is the concept of runoff contributing zones, which expand and contract seasonally and during storms depending on antecedent wetness and storm magnitude [36].The initial soil water content is more important in catchments with good vegetation.The runoff response is more uniform and is not dependent on initial soil moisture when the infiltration excess overland flow is predominant because of high rain depth or less permeable soils [30].Runoff from lower-intensity storms in soils of higher permeability is controlled by the soil water content of the surface soil layers and is more dependent on the initial conditions [30].We attributed the different control factors of the two Rainfall Regimes to rainfall characteristics and consequent runoff mechanisms.

Control Factors of Sediment Yield
The sediment response of catchments is controlled by a complex function of ecological, climatic, and geomorphic processes [37].All of the rainfall-related variables have important effects on TL for Rainfall Regime I. Raindrop splash detachment and surface overland flow are the two basic drivers of soil peel-back, flood generation, water loss, and sediment mobilization [38].However, these two drivers are both rooted in rainfall energy; vegetation reduces this energy, and thus, the Qmax observed at the outlet of the watershed may reflect the rainfall energy.The sediment yield of a catchment represents only a part of the total erosion or sediment production within the catchment, as a considerable portion of the sediment is often deposited before reaching the outlet [39].In the Wangjiaqiao watershed, 76% of the area has slope gradients exceeding of 30%.Cultivated sloping lands are major contributors to sediment yield, and a considerable portion of the sediment may move by gravity rather than by shear forces alone [15].Lu and Higgit [14] conclude that 60% of sediment is from arable land in 32 catchments in the TGA.The soil parent material of the watershed is predominantly purple sandy shale, and bedrock is typically exposed in the channel bed; thus, channel erosion is rare [18].Sediments stored in the channel and distributed within tributaries are transported after flood events with sufficient transport capacity.The PLSR model of TL for Rainfall Regime II illustrates that events with low rainfall depth and short duration typically cause very limited hydrological responses and limited sediment transport, in agreement with previous study [40].
Land use change is important for sediment yield [41]. Soil erosion is largely determined by the absence of protective land cover, whereas sediment export to rivers is determined by onsite sediment production and the connectivity of sediment sources and the river [42,43].Few studies have focused on how changes in land use can influence inter-event hydrologic process and sediment yield at a small watershed scale.Our results indicate that the percentage of forest and farmland in the study area has an important influence on sediment caused by Rainfall Regime I, i.e., heavy rainfall events.The rainfall energy of an event is typically characterized by Im and I30, which had important effects on TL in Rainfall Regime I. Thus, heavy rainfall events, which cause a large sediment yield, are influenced by more factors than lower-intensity rainfall events.A similar conclusion has been reported by Ni et al. [44], who demonstrates that soil erosion has many forms for heavy rainfall events, including sheet erosion, rill erosion, gully erosion, landslip, and even collapses.Thus, the sediment processes are complex, heterogeneous, and controlled by multiple factors.

Conclusions
In this paper, K-means clustering was used to classify 29 rainfall events into two rainfall regimes.Four separate PLSR models were constructed to identify the main variables that control runoff and sediment yield in a small agriculture watershed in the TGA.The results confirmed the complex and heterogeneous nature of the hydrology and sediment response in the watershed.

Figure 1 .
Figure 1.Location of the study watershed in the TGA of China.

Figure 2 .
Figure 2. Land use change in the study watershed for 1995, 2000, and 2005.

Figure 4 .
Figure 4. VIP values of each predictor for PLSR of runoff.

Figure 5 .
Figure 5. VIP values of each predictor for PLSR of sediment load.

Table 1 .
Changes in different land use categories as a percentage of the total watershed area.

Table 2 .
Flood variables and associated abbreviations used in the statistical analysis of the relationship between rainfall, runoff, and suspended sediment transport.

Table 4 .
Statistical features of the different rainfall regimes.

Table 5 .
Summaries of the partial least squares regression (PLSR) models.
aThe RMSECV (cross-validated root mean squared error), Q 2 cum (cross-validated goodness of prediction) per component, R 2 (goodness of fit), and Q 2 (cross-validated goodness of prediction) were calculated for the PLSR models.

Table 6 .
PLSR for runoff a .Values larger than 0.3, which are shown in bold face, indicate that the PLSR components are mainly loaded on the corresponding variables.W* means component.b RCs means Regression Coefficients.