Principal Component and Multiple Regression Analyses for the Estimation of Suspended Sediment Yield in Ungauged Basins of Northern Thailand

Predicting sediment yield is necessary for good land and water management in any river basin. However, sometimes, the sediment data is either not available or is sparse, which renders estimating sediment yield a daunting task. The present study investigates the factors influencing suspended sediment yield using the principal component analysis (PCA). Additionally, the regression relationships for estimating suspended sediment yield, based on the selected key factors from the PCA, are developed. The PCA shows six components of key factors that can explain at least up to 86.7% of the variation of all variables. The regression models show that basin size, channel network characteristics, land use, basin steepness and rainfall distribution are the key factors affecting sediment yield. The validation of regression relationships for estimating suspended sediment yield shows the error of estimation ranging from −55% to +315% and −59% to +259% for suspended sediment yield and for area-specific suspended sediment yield, respectively. The proposed relationships may be considered useful for predicting suspended sediment yield in ungauged basins of Northern Thailand that have geologic, climatic and hydrologic conditions similar to the study area.


Introduction
An estimation of suspended sediment yield is required for engineering practices that deal with improved land and water management practices in a river basin.The transport of sediment in rivers implies a series of negative effects, such as reservoir siltation and channel bed modification.Such effects may disturb the sediment balance in the basin.In particular, sediment that is eroded from sloping areas can accumulate in the river's network, thereby affecting channel water conveyance [1].Moreover, several problems due to soil erosion, such as the loss of fine and nutrient-rich topsoil that reduces land productivity, as well as the pollution of surface water bodies, are evident [2][3][4][5].The study of erosion and sediment yield has long established itself as an important area of hydrological research due to the economic significance of the processes involved.
Similar to other developing Southeast Asian countries, land degradation is a major problem in Thailand.This problem manifests itself in terms of the soil structure and its fertility deterioration, in particular for sloping land [6].Cultivation on sloping areas influences the environment in terms of siltation, flash floods, poor crop yields, etc. [7].The estimation of sediment yield is required in planning and designing water resource development projects, especially for studying the feasibility of a dam or a barrage, assessing sediment budgets and examining the delivery of sediment and contaminants to the estuarine or ocean system, which also provides a valuable means of studying the denudation process [8].However, sediment data is rarely available due to the lack of monitoring.Erosion and sediment transport are complex phenomena, and these processes are affected by several factors, such as climatic and geomorphological conditions, land use, etc.
The approaches employed to estimate sediment yield can be divided into four main groups [9], namely: (1) the soil erosion and sediment delivery approaches, wherein estimated soil erosion rates are factored by a sediment delivery ratio, which is often based on basin characteristics; (2) the physically-based and/or distributed basin modeling approaches, wherein the movement of water and soil is estimated in a distributed way throughout the basin; (3) the models relating sediment concentration or the load to the river flow, wherein measured sediment concentration data is related to river flow characteristics; and (4) empirical models based on broad basin and climate descriptors, wherein sediment yield equations are derived from known basin characteristics.The soil erosion and sediment delivery approaches are usually based on the Universal Soil Loss Equation (USLE) [10] and the concept of the sediment delivery ratio (SDR) [11].Although many combinations of erosion and sediment delivery modelling are available [12][13][14][15][16], they still require calibration and, thus, cannot be transferred from the study area to other catchments and environments.Moreover, USLE cannot be applied easily to non-agricultural land uses or to areas outside of the range of the original development and application [9].
The physically-based model describes the physical processes involved in the flow and transport of sediment, and these processes use the laws of the conservation of mass, momentum and sediment transport to explain the inherent processes; however, the physically-based model requires extremely onerous input data.When the input data is scarce, the large number of involved parameters may cause significant uncertainty in soil erosion estimates [1].Furthermore, the simulation of sediment transport at the basin scale is still computationally very expensive.The models relating sediment concentration or load to river flow are most commonly used in practice.These models assume that river flow, rather than sediment supply, is the dominant factor in sediment yield.However, such models also require a large amount of data to give realistic estimates of long-term average annual sediment yield.This approach is based on "what has happened" rather than "what may happen".Understanding sediment supply and transport processes is required to extrapolate their potential consequences during unmonitored future climate and/or land-use scenarios.The empirical model is based on limited knowledge of the processes and relies on the data describing input and output behavior.This method, however, is able to make abstractions and generalizations of the process and often complements the physically-based model [17].
Several authors have shown the effectiveness of statistical relationships, which allow one to estimate river sediment transport depending on easily available geomorphologic, hydrological and climatic parameters [1,[18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33].Sediment yield is controlled by factors that control erosion and sediment delivery, including local topography, soil properties, climate, vegetation cover, catchment morphology, drainage network characteristics and land use [26,28].Langbein and Schumm [32] studied the relationship between mean annual precipitation and sediment yield in the United States, while Walling and Webb [31] concluded that no simple relationship exists between climate and sediment yield, because climate's effect on sediment load is very complex.Anderson [33] proposed three major groups of explanatory variable as being involved in relating sediment yield to watershed variables.These are the hydrologic event variables, the watershed conditions and land use variables, as well as the inherent watershed variables, such as area, geology and physiography.He also mentioned that sediment measuring device and its efficiency is also important in having accurate sediment measurements.Bray and Xie [29] identified six categories of variables that can be related to the processes associated with the generation and delivery of suspended sediment to the basin outlet in Canada, which are hydroclimatic conditions, basin topographic features, land surface features, soil characteristics, channel network features and human activities.Ciccacci et al. [30] and Grauso et al. [1] investigated the correlation between the sediment yield and some geomorphologic, hydrological and climatic parameters in Italy.They found a significant relationship between average yearly sediment yield per unit watershed area and the drainage density.Restrepo et al. [23] developed a multiple regression model for estimating the sediment yield in a South American watershed.They reported six catchment variables that predict sediment yield, including runoff, precipitation, precipitation peakedness, mean elevation, mean water discharge and relief, while the mean annual runoff is the dominant control factor.Syvitski and Milliman [22] provided a description of factors influencing the estimation of sediment loads from rivers, which are drainage area size, basin relief, geologic condition, climate and vegetation cover.They successfully estimated the long-term flux of sediment delivered by rivers to the coastal zone (488 global rivers) by the BQART model, which is influenced by geomorphic and tectonic characteristics, geography, geology and human activities.Recently, Cohen et al. [19] introduced a comprehensive global fluvial sediment predictor named WBMsed (Water Balance Model with sediment), a distributed global-scale riverine sediment flux model.The major important inputs for the model are anthropogenic factors, ice cover, lithology, reservoir sediment trapping, drainage area size, maximum basin relief, daily temperature and daily discharge.
The statistical method for reducing a large number of interrelated variables into a smaller number of dominant variables is called principal components analysis (PCA) and has been used in many areas of scientific research [17,[34][35][36][37][38][39][40].Recently, Tayfur et al. [41] investigated sediment load prediction and generalization from the laboratory scale to the field scale using principle component analysis (PCA) in conjunction with data-driven methods of artificial neural networks and genetic algorithms.In spite of these several uses, there is a disadvantage to PCA: the interpretability of the second and higher components may be limited.For this reason, Varimax rotation is applied to the PCA's solution to enhance the interpretability of the components by maximizing a simple structure.An alternative rotational approach is known as the independent component analysis (ICA) [42][43][44], which finds a linear representation of non-Gaussian data, so that the components are statistically independent.Westra et al. [44] report that the PCA and Varimax rotations provide fairly accurate interpretations for global and local phenomena, respectively, while the interpretability of ICA results appears to be less successful.
The objectives of this study are to propose a complementary methodology that can be used in the prediction of suspended sediment yield in an ungauged basin (i.e., one where the river flow data is unavailable) based on a data-driven modeling approach.The use of the PCA with Varimax rotation to identify the key factors affecting sediment yield and the use of multiple regression analysis to establish the relationships between suspended sediment yield and the basin's characteristics in terms of geomorphology and climate are also investigated.

Study Area
The study basin covers an area of 102,636 km 2 of Ping, Wang, Yom and Nan river basins in Northern Thailand.It is located between 15°30′ N and 20°00′ N latitudes and 98°00′ E and 101°30′ E longitudes (Figure 1).The Ping, Wang, Yom and Nan rivers are the main tributaries of the Chao Phraya River, the most important river of Thailand.These four tributaries originate from the Phi Pannam Mountain and course through mountainous areas before merging with each other in the alluvial plains of the Nakhon Sawan Province to form the Chao Phraya River.The study area is mountainous, with agriculturally productive valleys.The Ping, Wang, Yom and Nan rivers travel from north to south.The climate of the study area is dominated by seasonal monsoons.The rainy season that lasts from May to October is influenced by the southwest monsoon from the Indian Ocean and the depressions originating in the Pacific Ocean.The average monthly temperature ranges from 15 °C in December to 40 °C in April, except in high altitude locations.The study area can be classified as a tropical rainforest with high biodiversity.The general description of the study area [45] is presented in Table 1.In terms of soil erosion, Alford's report [46] on mountain watersheds informs us that the Chao Phraya river basin, in Northern Thailand, showed no evidence of a significant increase in sediment yield during the period extending from the late 1950s to the mid-1980s.However, the Northern region of Thailand is very vulnerable to soil erosion, due to its undulating topography, steep slopes and high rainfall.Due to rapid economic development and population growth in the area, the forest-covered land in this northern region decreased from 68.54% in 1961 to 54.27% in 2004 [47].The most vulnerable area is steeply sloping land, which is under cultivation (more than 35% of sloping land).In recent times, human encroachment on forest areas in the upper part of the study area and land use changes with respect to agriculture have become problematic [48].

Framework of the Analysis
The overall study framework involves basin data collection, principal component analysis (PCA) and multiple regression analysis.The data used in the analysis were obtained from hydro-meteorological stations, topographic maps, soil maps and land use maps of the study area.PCA is employed to determine the most prominent variables, which are then used in multiple regression analysis.The details of the data compiled and the methodology employed are presented in the following sections.

Geomorphic Parameters
The topography of the study area was acquired as a 30-m digital elevation model (DEM) from the Geo-Informatics and Space Technology Development Agency (GISTDA).The 30-m DEM was aggregated to 150-m resolution.This aggregation was done for the further use of the DEM in the physically distributed watershed model, Distributed Hydrology Soil Vegetation Model (DHSVM) [49], in the next phase of this research, which will be published in the near future.The characteristics of each of the sub-basins within the study area were then derived using HEC-GeoHMS 10.1 [50].These characteristics include basin area, basin perimeter, basin length, basin slope, main channel length, distance between the basin outlet and a point on the stream nearest to the centroid of the basin area, total channel length, drainage density, basin relief, relief ratio, basin elongation and basin circularity.Most of the extracted basin areas and river networks match well with the existing GIS maps published by the Department of Water Resources (DWR) [51].It is worth noting that a few sub-basins could not be delineated in areas that are relatively plains, which create difficulties in delineating the river and basin boundary.The river network's properties, namely the hierarchical anomaly index and the hierarchical anomaly density [1,21,30,52,53], were estimated based on the digitized river network derived from 1:50,000 topographical maps obtained from the Royal Thai Survey Department, which was satisfactorily compared to the existing river network [51].To elaborate on the river network's properties, let us assume G as the number of first order streams necessary to make a drainage network perfectly ordered in a binary tree-shaped structure with streams of order K flowing into streams of order K + 1, and N is the number of first order streams present in the drainage network.The hierarchical anomaly index (DA) is given by the ratio of G to N, while the hierarchical anomaly density (GA) is the ratio of G to the basin area in square kilometers.These two parameters express the organization degree of drainage networks.Ciccacci et al. [30] and Grauso et al. [21] provide more details of these two parameters.

Soil Properties
The soil map of the study area was drawn using the Soil Program software [54], which derives 5-min resolution (about 10 km) soil data from the World Inventory of Soil Emission Potentials (WISE) pedon database [55], developed by the International Soil Reference and Information Centre (ISRIC) and the FAO-UNESCO Digital Soil Map of the World [56].In this study, the soil clay content as a percentage was extracted and used as the soil's representative property.

Land Use
The digital land use map, obtained from the Land Development Department (LDD) of the Royal Thai Government, was employed in this study.It was derived from Landsat 5 satellite imagery with 30-m resolution and the ground truth survey of 2000-2003.The forest and agricultural areas were extracted from each of the sub-basins to represent the land cover property used in the analysis.

Hydro-Meteorological and Sediment Data
The daily suspended sediment yield data, observed at 37 gauging stations operated by the Royal Irrigation Department (RID) and the Department of Water Resources (DWR) of the Royal Thai Government, were obtained as presented in Table 2. Daily suspended sediment data were calculated using the Sediment-discharge rating curve technique.The rating curve is derived using at least 20 measurement points per year for each station.The United States standard sampling method and equipment e.g., depth integrating sampler (US DH-48, US DH-49 and US DH-59) or point integrating sampler (US-P-46, US-P-61, US-P-63 and US-P-50), are employed based on water depth and the accessibility of each measurement point [57].The examples of sediment-discharge rating curve equations provided by RID for the year 2000 for Ping, Wang, Yom and Nan river basins are: The annual suspended sediment yield was calculated using the daily data for each of the selected sub-basins.Table 2 shows the general description of the suspended sediment gauging stations in this study.For climate characteristics, annual rainfall, wet season rainfall (May-October), dry season rainfall (November-April) and the precipitation concentration index [58] were estimated.The mean areal rainfall was estimated by the Thiessen polygon method using the ArcView ArealRain Extension [59].In each of the sub-basins, the time series data-suspended sediment yield, annual rainfall, etc.-were averaged as long-term average data for further analysis.The glossary and summary statistics of the variables used in this study are given in Table 3.

Principle Component Analysis
Principal component analysis (PCA) was applied in this study to identify the factors influencing suspended sediment yield.The PCA is a method of data reduction that aims to identify a small number of derived variables from a larger number of original variables in order to simplify the subsequent analysis of the data [60,61].Moreover, the PCA has been used in the present study as the preliminary step in the development of a prediction model [62].The sequence of the main steps involved in the PCA, as applied by Halim et al. [38], were adapted and are described below: (1) Selection of a set of basin characteristics and meteorological indicators for the study area.
The initial set consisted of 17 basin characteristics and 4 climate factors (Table 3).(2) Assessment of the suitability of data for the PCA using the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy [63] and Bartlett's test of sphericity [64].KMO tests the ratio of item correlations to partial item correlations.If the partials are similar to the raw correlations, it means that the item does not share much variance with other items.This is a necessary criterion, since PCA assumes that common factors are the source of variance for the variables under investigation.The range of KMO is from 0.0 to 1.0; however, the score of 0.50 is suggested as the minimum value for a good PCA [65].Bartlett's test of sphericity checks for the hypothesis that the correlation matrix is an identify matrix, which means that all of the variables are uncorrelated.The significance value for this analysis led us to reject the null hypothesis and conclude that there are correlations in the data set that are appropriate for the PCA.The score from Bartlett's test of sphericity with significance at 95% (p < 0.05) is considered appropriate for the PCA [61].In addition, Tabachnick and Fidell [66] also recommend that, for the PCA, the correlation matrix should show at least some correlations, with the correlation coefficient being greater than or equal to 0.30.(3) Determination of dominant factors.The PCA with Varimax rotation is performed to identify the principal components (PCs) or subsets from a larger data set.For selecting the dominant factors, Kaiser's criterion or the eigenvalues rule, i.e., only components with eigenvalues of 1.0 or more are retained for further investigation [38,67,68], was employed.

Regression Analysis
The regression relationships between suspended sediment yield and the dominant factors obtained from the PCA, i.e., biophysical and climate factors, were established using Equation (1).In order to avoid the negative lower boundary of estimation, a logarithmic transformation was used.The regression coefficients were obtained by ordinary least squares linear regression on logarithms of response and predictor variables.Finally, a back-transformed relationship was obtained in the form [62]: where Y is the response variable (suspended sediment yield in this study), X 1 , X 2 , …, X p are the predictor variables (the factors influencing suspended sediment yield) and β 0 , β 1 , β 2 , …, β p are constants derived by the multiple linear regression analysis.The most commonly used procedure for selecting the best regression equation is stepwise linear regression analysis (using an F probability of 0.05 for the selected factor), as described by Landau and Everitt [60], was performed using SPSS for Windows Release 11.5.
Generally, the size of the drainage area is an important factor for both suspended sediment yield and area-specific suspended sediment yield.The relationship between the size of the drainage area and suspended sediment yield is complicated by many other factors, such as rainfall, plant cover, texture of the sediment and land use [69].In order to evaluate the effect of each dominant factor in predicting suspended sediment yield in various categories of basin sizes [70] and based on the available data, regression models were generated based on 4 groups of data: (1) 7 sub-basins with a drainage area of less than 100 km 2 (small basins); (2) 15 sub-basins with a drainage area of more than 100 km 2 , but less than 1000 km 2 (medium basins); (3) 8 sub-basins with a drainage area of more than 1000 km 2 (large basins); and (4) all 37 sub-basins irrespective of drainage area size.Notes: The overall data is 37 samples from 37 sub-basins; * lower limit; ** upper limit.

Model Validation
From the 37 samples (37 selected stations), 30 samples were used for the multiple regression model's development, while the remaining 7 samples were randomly excluded, based on the drainage area size, for the validation of the model.Additionally, a method called the jack-knife technique [62] was applied to examine the validity of developed regression models.This technique is generally performed by excluding one sub-basin from the total sub-basins.After that, regression having the same form as that of the general model was fitted using the all-but-one sub-basins, and the suspended sediment yield (SSY) or area-specific suspended sediment yield (ASSY) of the left out sub-basin was estimated by the obtained regression model called the test model.The calculation procedure was repeated for all sub-basins, and the coefficient of determination of the test model was calculated.Furthermore, the Pearson product-moment correlation coefficient between the predicted SSY or ASSY from the general and test models was also calculated.

Factors Influencing Suspended Sediment Yield
In this study, prior to performing PCA, the suitability of data for analysis was assessed.There were 37 datasets of 23 variables consisting of 17 basin characteristics, four climate factors and two sediment related variables (Table 3).The cross-correlations among 23 variables are given in Table 4.The KMO score was 0.59 and Bartlett's test of sphericity showed significance at 95%, which reasonably supports the factorability of the cross-correlation.Additionally, the correlation matrix showed many correlation coefficients to be above 0.30.Therefore, factor analysis could be applied to reduce the number of factors in this study.
The PCA results based on the correlation matrix analysis with Varimax rotation indicate six principal components with eigenvalues greater than 1.00, which correspond to an overall cumulative variance of 86.7%.The order of significance of these variables is determined by the magnitude of their eigenvalues, as presented in Table 5.
The different variables considered in the PCA and their factor loadings within their respective PCs are presented in Table 6.It shows that the high weighted variables (factor loading ≥ 0.60) for PC1 consist of the total channel length (TCL), basin area (AREA), main channel length (MCL), distance from the basin outlet to a point on the stream nearest to the centroid of the basin area (LC), basin perimeter (BP), suspended sediment yield (SSY), basin length (BL), hierarchical anomaly index (DA), and hierarchical anomaly density (GA).PC2 consists of wet season rainfall (WSR), annual rainfall (AR) and hierarchical anomaly density (GA).PC3 consists of basin slope (BS), relief ratio (RR) and basin circularity (BC).PC4 consists of agricultural area (AA), forest area (FA) and area-specific suspended sediment yield (ASSY).PC5 consists of the precipitation concentration index (PCI) and dry season rainfall (DSR).Lastly, PC6 has basin elongation (BE) as a variable with high loading.From PC1 and PC4, it was seen that the suspended sediment yield (SSY) corresponds to the basin size (AREA), whereas area-specific suspended sediment yield (ASSY) corresponds to such land cover characteristics as forest area (FA) and agricultural area (AA), which imply that forest cover can reduce the erosion rate.
In addition, from Table 6, basin relief (BR) and top soil clay content (TSCC) show less commonality than the others, with scores of 0.319 and 0.645, respectively.These numbers suggest that a substantial portion of the variable's variances is not accounted for by these two factors, and these are considered as less closely related to other variables.To select prominent variables for subsequent regression analyses, the first three variables with the highest factor loadings and greater than 0.60 were selected as representative variables of each of the PCs.A threshold of 0.60 was used for identifying a reliable factor in this study [71].Therefore, for PC1, the total channel length (TCL), basin area (AREA) and main channel length (MCL) were selected.For PC2, wet season rainfall (WSR), annual rainfall (AR) and hierarchical anomaly density (GA) were employed.For PC3, basin slope (BS), relief ratio (RR) and basin circularity (BC) were used.Agricultural area (AA) and forest area (FA) were extracted from PC4, whereas area-specific suspended sediment yield (ASSY) was considered as the response variable in the regression analysis.For PC5, the precipitation concentration index (PCI) and dry season rainfall (DSR) were chosen.Finally, only basin elongation (BE) was considered from PC6.All 14 factors were assumed to be the forcing factors of suspended sediment yield with positive and negative effects, which can be used subsequently as predictor variables in regression analysis.

Regression Relationships to Estimate Suspended Sediment Yield
From the 37 samples (37 selected stations), 30 samples were used for the multiple regression model's development, while the remaining seven samples were randomly excluded, based on the drainage area size, for the validation of the model.The excluded stations comprise 060201 (47.54 km 2 ), 060602 (163.35km 2 ), N58 (296.53 km 2 ), P24A (449.33 km 2 ), Y26 (787.01 km 2 ), N24 (1816.70 km 2 ), and N40 (4180.45km 2 ) (shown in Table 2).The other 30 samples were used in the multiple regression analysis, which was performed using the selected 14 factors-TCL, AREA, MCL, WSR, AR, GA, BS, RR, BC, AA, FA, PCI, DSR and BE-as the predictor variables.SSY and ASSY were taken as response variables.The analysis was done using the stepwise regression technique [60] in each of the groups.The technique was applied based on the drainage area.To ensure that there is no multi-collinearity in the analysis [61], the result of the regression equations was finally inspected to ensure that there were no inter-correlations among the predictor variables.The sample adequacy criteria suggested by Haan [62] was also considered; he suggested that the sample number should be at least three-or four-times the number of predictor variables.The results of the multiple regression analysis are presented in Table 7.Based on the Stepwise regression analysis, ANOVA shows the significance to be less than 0.001 as per the F-test for all of the equations, while some cases have no relationships, because of statistical insignificances.Based on the coefficient of determination, R 2 and the standard error of estimation, it can be concluded that ASSY develops better relationships with the selected dominant factors than SSY in cases of a drainage area less than 1000 km 2 .In contrast, the prediction of SSY is more reliable than the prediction of ASSY in the case of basins with larger drainage areas (more than 1000 km 2 ).This implies that a larger basin contributes to complexity and uncertainty in ASSY modeling.
To evaluate the effect of predictor variables on suspended sediment amount, it was found that GA, AREA and FA contribute to SSY estimation, while BS, GA, MCL, FA and DSR contribute to ASSY estimation.Therefore, it was concluded that basin size, channel network characteristics, land use, basin steepness and rainfall distribution are the key factors affecting the amount of suspended sediment.For the medium-sized basins (100 to 1000 km 2 ), the regression relationships imply that less forest cover or more agricultural area contribute to more SSY and ASSY.The larger the basin size, the more is SSY.It was also found that BS, or basin slope characteristics, is the factor affecting ASSY for a drainage area of less than 100 km 2 .A higher basin slope contributes to higher ASSY, only for basin size less than 100 km 2 .Additionally, high dry season rainfall leads to a high amount of ASSY.This physically implies that the land use/cover (e.g., crops) during the dry season relatively enhance soil surface erosion compared to the wet season.Considering all data samples (irrespective of basin size), the relationships shown in Equations ( 7) and ( 8) revealed quite interesting results that the amount of suspended sediment depends on the basin size and dry season rainfall irrespective of the geomorphological conditions.Kazama et al. [72] also pointed out that, in the Mekong Basin, the suspended sediment transport is highly sensitive to particle size compared to the channel bed slope.This may mean that these regions (the study area and Mekong region, which are adjacent to each other) have a low influence of topography on sediment yield.

Regression Model Validation
The summary results of the jack-knife technique are presented in Table 8.Figures 2 and 3 elaborate the validation results of Equation (7).These results indicate that all general models have high correlation coefficient (R) between general and test models.However, some test models represented by Equations ( 2), ( 4) and ( 6) give a relatively low value of R 2 compared to Equations (3), ( 5) and (7).This also supports the result of regression analysis in the previous section that ASSY provides better correlations to the factors than SSY when the basin area is smaller than 1000 km 2 .The results apparently show that Equation ( 8) is considered to be less reliable with very small values of R 2 for both general and test models, which might result in relatively higher error values.Thus, the use of Equation ( 8) is not recommended.The seven stations that were left out initially were used for model testing.The validation results for SSY and ASSY are given in Tables 9 and 10, respectively.The graphical presentation is also consecutively shown in Figures 4 and 5 for SSY and ASSY, respectively.The validation results indicate that, in most cases, using the model in a particular group based on the drainage area size provides more accurate values than using a model developed from all data sets.The error of estimation ranges from −55% to +315% for SSY prediction and −59% to +259% for ASSY prediction (Equations ( 7) and ( 8) are excluded).However, if Equations ( 7) and ( 8) are employed, the estimated error of SSY and ASSY will range from −76% to +514% (last column in Table 9) and −76% to 622% (last column in Table 10), respectively.Figures 4 and 5 also show that Equations ( 7) and (8) give a relatively higher error of estimations compared to the models developed for three classes of basin area.1,000,000 1 10 100 1000 10,000 100,000 1,000,000 Predicted SSY (ton yr -1 ) Observed SSY (ton yr -1 ) Equation ( 2) Equation ( 4) Equation ( 6) Equation ( 7)

Conclusions
The investigation of factors affecting suspended sediment yield in the Ping, Wang, Yom and Nan river basins in Thailand, using principal component analysis, is presented in this study.From the principal component analysis, six components of dominant factors influencing suspended sediment yield were identified.These factors contribute to 86.7% of the total variance of all variables considered in the analysis.The dominant factors from each group were then taken as predictor variables in the successive multiple regression analysis to estimate suspended sediment yield and area-specific suspended sediment yield.
From the regression analysis, it was found that there are three factors that significantly affect suspended sediment yield.These factors are hierarchical anomaly density, basin area and forest area.On the other hand, there are five factors that significantly influence area-specific suspended sediment yield.These are basin slope, hierarchical anomaly density, main channel length, forest area and dry season rainfall.The regression models indicate better predictability of suspended sediment yield and area-specific sediment yield for basins with a drainage area of less than 1000 km 2 .
A set of equations for predicting suspended sediment yield and area-specific suspended sediment yield for basin areas with different sizes within the error of estimation range was proposed.These equations may be used to estimate the expected sediment yield in ungauged basins in the planning and design of water and land development and conservation projects in the northern part of Thailand with easily determined dominant input variables.However, it should be noted that the error of estimation for suspended sediment is relatively high, which is partially due to uncertainties in the sediment sampling/measurement (especially during high discharges or flood events) and in developing the sediment-discharge rating curve equations.Additionally, these models were developed for the estimation of average annual suspended sediment in ungauged basins.Therefore, the application of the models for sediment yield on short time periods, such as event-based estimation, is not recommended, due to the hysteresis effect in sediment rating curves.Observed ASSY (ton km -2 yr -1 ) Equation ( 3) Equation ( 5) Equation (8)

Figure 1 .
Figure 1.The study area showing the locations of suspended sediment gauging stations.

Figure 2 .
Figure 2. The general (for Equation (7)) and test models' correlation diagram of predicted versus observed SSY.

Figure 3 .
Figure 3. Scatter plot of the SSY results of the general model and Equation (7) versus the test models.

Table 1 .
A general description of the study area.
Since there are no existing water infrastructures upstream of the gauging stations, the data are free from the effects of regulating structures.The daily rainfall data, monitored by RID and the Thai Meteorological Department (TMD), were obtained from 125 stations located in the study area.Since land use data were available for 2000-2002, both sediment and rainfall data collected from 1995 to 2007 (based upon availability) were used in the study, assuming that the land use remains the same and that there are no major man-made changes taking place for the study data period.

Table 2 .
List of suspended sediment gauging stations in this study.
Notes: * Stations No. 1 to 14 belong to Department of Water Resources (DWR), and those from No. 15 to 37 belong to Royal Irrigation Department (RID); ** the basin area is extracted from DEM; *** bold stations and data are not used in developing the regression model and are used for validating the developed model.

Table 3 .
Glossary and summary statistics for the sub-basin's characteristics.

Table 4 .
Correlation matrix of the identified variables.

Table 5 .
Principal components (PCs) for basin characteristic and climate factors.

Table 6 .
Results of principal component analysis (Varimax rotated component matrix).

Table 7 .
Results of the multiple regression analysis.

Table 8 .
Model validation results using the jack-knife technique.

Table 9 .
Validation results for SSY prediction.

Table 10 .
Validation results for ASSY prediction.