Prediction of Biomass Production and Nutrient Uptake in Land Application Using Partial Least Squares Regression Analysis

Partial Least Squares Regression (PLSR) can integrate a great number of variables and overcome collinearity problems, a fact that makes it suitable for intensive agronomical practices such as land application. In the present study a PLSR model was developed to predict important management goals, including biomass production and nutrient recovery (i.e., nitrogen and phosphorus), associated with treatment potential, environmental impacts, and economic benefits. Effluent loading and a considerable number of soil parameters commonly monitored in effluent irrigated lands were considered as potential predictor variables during the model development. All data were derived from a three year field trial including plantations of four different plant species (Acacia cyanophylla, Eucalyptus camaldulensis, Populus nigra, and Arundo donax), irrigated with pre-treated domestic effluent. PLSR method was very effective despite the small sample size and the wide nature of data set (with many highly correlated inputs and several highly correlated responses). Through PLSR method the number of initial predictor variables was reduced and only several variables were remained and included in the final PLSR model. The important input variables maintained were: Effluent loading, electrical conductivity (EC), available phosphorus (Olsen-P), Na+, Ca2+, Mg2+, K2+, SAR, and NO3−-N. Among these variables, effluent loading, EC, and nitrates had the greater contribution to the final PLSR model. PLSR is highly compatible with intensive agronomical practices such as land application, in which OPEN ACCESS


Introduction
Plant biomass production and nutrient recovery are important management goals in land application associated with treatment efficiency, potential impacts on the environment, and economic benefits [1,2].Plant species with high biomass potential usually achieve increased nutrient recovery resulting from high biomass yield and nutrient assimilation in plant tissues [3].On the other hand, such plant species may receive high effluent loading, due to their high water requirements, which in turn may result in changes in soil properties and put the surrounding environment at risk as a result of the increased nutrient and/or pollutant release [4][5][6].The latter as well as potential negative impacts on soil properties are undesirable during land application and may have negative influence on vegetation and overall system performance.Considering this close relationship between vegetation performance and soil properties in land application any potential quantitative description of this relationship in the form of a strong prediction model would be valuable and could provide useful information during system design and monitoring.Until now several statistical methods have been used to develop efficient prediction models in crop and soil science, such as principal component regression (PCR), multiple regression, and partial least squares analysis [7][8][9][10].
Partial least squares regression (PLSR) has become a popular statistical technique used widely in Chemometrics and other related areas [11][12][13].There is also wide application in soil and crop studies providing information either for soil or plant parameters by considering spectroscopic measurements [14][15][16][17].PLSR regression analysis construct models by linking predictors (Xs) and responses (Ys) and this is achieved via a projection procedure which reduce data dimensionality to a small number of important factors (also called latent variables).Characteristic of the method, unlike other projection methods (e.g., PCR) is that it integrates the compression and regression steps while selection of the orthogonal factors is carried out to achieve maximum covariance between the predictor and response variables [11].Because of its principles, PLSR approach fits best in cases when matrix of predictors has more variables than observations and predictor variables are highly collinear [18].
PLSR models can deal effectively the following scenarios: (a) wide data (where number of input variables is much greater than the number of observations); (b) tall data; (c) square data; (d) collinear data; and (e) noisy data.This makes PLSR suitable for studies dealing with land application where usually a great number of highly collinear and noisy input variables is monitored to describe plant species performance and effects on soil properties and environment.However, PLSR and even multivariate techniques in general are either lacking or under-utilized in land application schemes.Thus, the primary objective in the present study was the application of a PLSR approach in order to develop a robust model capable of predicting important management goals in land application (i.e., biomass production and N and P uptake) using as predictor variables critical soil parameters.The information provided here is expected to help in the development of the appropriate methodology during design and monitoring of intensive agronomical practices, such as land application in quest of appropriate management strategies with respect to vegetation and field practices.

LTS Set Up, Sampling, and Chemical Analyses
A three-year-field trial with four different plant species (Eucalyptus camandulensis, Acacia cyanophylla, Populus nigra, and Arundo donax), each forming a separated land treatment system (LTS), was carried out at Skalani village, located approximately 5 km south of Iraklion city, Hellas (at 35°16'50.87"N, 25°10'52.61"E).Plant species received septic tank municipal effluents for three consecutive years (2001)(2002)(2003) at hydraulic loading rate based on crop water requirements and evaporation losses.The soil in which LTS were established was characterized as a clay loam with relatively high calcium content (55% CaCO3).Details about LTS set up, effluent loading and characteristics, soil properties, climatic conditions of the area, and methods used to determine and assess soil data were previously described (soil surface data from 0-7.5 to 55-65 cm obtained from the third irrigation period were included in PLSR as described below) [4].In brief soil samples prepared and analyzed according to methods referred to the Methods of Soil Analysis [19].pH, EC, soluble Na + , Ca 2+ and Mg 2+ were assessed in saturation paste extracts with atomic absorption spectrometry (Ca 2+ and Mg 2+ ) and flame photometer (Na + ).Soil organic matter (SOM) was assessed by the Walkley and Black wet-digestion method and available-P according to the Olsen method after extraction with NaHCO3.Total Kjeldahl Nitrogen (TKN) was assessed by a macro-Kjeldahl device and analysis of NO3 − -N in soil solution samples was carried out using the phenol-disulfonic acid method.In the present work, additional measurements were carried out: Soil C:N ratio was determined as the quotient of organic matter and TKN contents; gravimetric moisture content (ω) was determined by oven drying a representative undisturbed ring of moist soil at 105 °C; bulk density (ρb) was determined by the weight of the soil per unit volume (g/cm 3 ) at 105 °C.Samplings and measurements regarding biomass and nutrient recovery across plant species are also presented in our previous study [3].In brief, at the end of every growing season (October), one representative tree from each of four plots was harvested and separated into individual organs.The fresh weight of leaves, shoots and trunk (old wood) were recorded.For reeds, the whole plot surface was harvested each season and separated in leaves and shoots.Dry weights of vegetation were determined by drying (65 °C) to a constant weight.In 2002 and 2003 one replicate plot was harvested from each treatment and the tissue dry weights were determined.The dried samples were ground to 1-mm and used in elemental analysis.Micro-Kjeldahl N-digestion was used to determine total-N content of biomass samples [20] and P content was determined by the vanado-molybdo-phosphoric acid colorimetric method after digesting the samples with a mixture of perchloric-nitric acid.

Statistical Analysis
The effluent loading and some soil parameters were selected according to their importance in system description and ease of determination as candidate X variables in the PLSR model.This model contained plant biomass and N and P uptake as response variables (Ys).The soil parameters were SOM, dissolved organic matter (as COD), TKN, pH, EC, soil solution NH3-N (in soil solution sampler), soil solution P, soil solution EC, NO3 − -N, C:N, Olsen-P, Na + , Ca 2+ , Mg 2+ , SAR, K + , ρb, and ω.Most of these soil input variables were measured in each location across five depth intervals over the 0.65 m of soil profile and their average values were used in the initial PLSR model to eliminate some of the noise.After the preliminary PLSR regression the important Xs variables were identified and included in the final PLSR prediction model.
The non-linear iterative partial least squares (NIPALS) algorithm was used for computing the first few factors.KFold validation was used to select the number of factors that minimize the Root Mean PRESS statistic.The variable VIP (variable importance for the projection) value measures its influence on the factors that define the model.Wold and others advocated cut-off-values for VIP to separate terms that do not make important contribution to the dimensionality reduction involved in PLSR (VIP < 8) and those that might (VIP ≥ 8).In addition, the percentage of variation explained for X variables and Y responses and the contribution of each of the important factors were assessed.Loadings were also calculated and plotted to give another way to view the relationships between the Xs the Ys and the PLSR factors.For each X variable VIP (variable importance for the projection) was calculated to assess its importance in the determination of the PLSR projection model for both predictors and responses [21].Also, Xs coefficients in the PLSR model were calculated to assess their contribution to the prediction of the Ys.Based on PLSR model the predicted values for the responses were calculated and plotted versus the observed values.Also, validation of PLSR prediction model was performed based on data derived from the previous year (2002).All analyses were carried out the PLS platform of JMP ® (SAS Institute Inc.: Cary, NC, USA) Pro Version 11.2.1 [22].

Results
Preliminary PLSR regression removed several soil parameters with minor contribution to the prediction model.The remaining parameters were included in final PLSR model.These parameters were effluent loading, EC, available phosphorus (Olsen-P), Na + , Ca 2+ , Mg 2+ , K 2+ , and NO3 − -N.Based on PLSR method, the above X data set was reduced to two principal factors.The first explained the 45.4% of the variation while the second explained the 34.2%.Thus, the cumulative variation explained by two principal factors was 79.6%.The percentage of variation explained for X variables and the contribution of each of the two factors are shown in Figure 1a.Among X variables K + , SAR, and Na + were the greater contributors to explained variation.Also, Mg 2+ , EC, nitrates, effluent, Ca 2+ , and P had distinctively higher factor loadings (X loadings) at the first factor compared to K + , Na + , and SAR.With regard to second factor SAR, Na + , EC, and effluent received positive loadings and were higher than the other variables (Figure 1c).In terms of Y data (Y responses) two factors explained the 78.3% of the variation, with the first contributing with 70.4% and the second with 7.9%.Total biomass had the highest percentage of the explained variation followed by N uptake and P uptake (Figure 1b).In the first and strongest factor, plant biomass had the highest factor loading (Y loadings) and N uptake the lowest.In the second factor, N uptake received the highest factor loading over plant biomass and P uptake (Figure 1d).The amount of the effluent, EC, nitrates and Mg 2+ were among the most powerful X variables in the determination of PLSR model (Figure 2).Based on 0.8 threshold two X variables (SAR and K + ) were less influential in the final PLSR model (Figure 2), however, because of their importance for Xs factors (Figure 1a) and contribution in the prediction of Ys (Figure 3) were retained within the model.Effluent had the greater coefficient values across Ys prediction models followed by EC, Na + , SAR, and/or nitrates dependent on the Y response (Figure 3).Specifically, apart from the amount of effluent, EC had also high coefficient values either for the prediction of biomass produced or prediction of the nutrients.Na + and SAR also received high values with exception of the P recovery prediction model.Interestingly, nitrates received high values in both biomass and P recovery, but mid values were registered in terms of N recovery model.Predicted values derived from PLSR model were plotted over the observed values showing close relationship across all Y responses (Figure 4).

Discussion
The PLSR model in this study is well fitted to the experimental data and successfully predicted the established Y responses (i.e., plant biomass and P and N uptake).Important X variables were effluent loading and certain soil parameters such as soil salinity and nitrates.Effluent loading is considered essential component in the design of land application associated with organics and nutrient loading, growth of plant biomass, and potential environmental impacts [2].In the present study it was applied at rates equivalent to the evapotranspiration requirements of plant species, which reasonably justify its close relationship with the response variables.
Soil salinity, expressed by EC, was also an important contributor to PLSR model showing strong and positive relationship to the response variables.Soil salinity is closely associated with effluent loading which in turn is associated with plant biomass.For example, species with great biomass potential receive high effluent loading that may result in salts accumulation in the rhizosphere [4,6].This effect, however, is temporary because of the effect of rainfall which removes salts from the rhizosphere reducing the possible negative impacts on soil and vegetation.EC also has been included in a previous PCR model predicting plant biomass from several soil parameters [23].With regard to Na + , it had also significant contribution to the prediction model, particularly for biomass and N uptake, as shown in Figures 2 and 3. SAR had lower contribution than Na + indicated also by lower VIP and relative smaller standardized regression coefficients.Increase in Na + , and SAR values do not necessary imply risk for soil physical properties and/or crop yield considering the high salt and organic matter contents in the effluent.Indeed, in this field experiment effluent application resulted in enhancement in soil physical properties, presented in a previous study [4], an effect that is in agreement with the findings of previous studies [24,25].
Interestingly nitrate was important predictor in our model being among the most important X variables particularly for plant biomass and P uptake.This confirms previous arguments linking nitrates with plant biomass and effluent loading [1,3].Inorganic N availability in the soil is considered important driver of plants growth with significant effect on biomass yield, tissues nutrient content, and total nutrient/pollutants assimilation [26][27][28][29].The lower effect of nitrates in the model compared to effluent could be attributed to the existence of additional factors with significant influence on fate of inorganic N in the soil.In this field study inconsistent results were registered between the produced biomass and nitrates in A. donax species compared to other species [3,4], which is in agreement with the findings of a recent study [30].Authors suggested the potential effect of species on soil microbial communities with significant role in N cycling.Moreover, N overloading, up to a critical threshold, and the favorable environmental conditions that induced high nitrification rates may have mitigated the contribution of nitrates in the model.
In this PLSR model important, yet easily measurable, parameters were included as predictor variables.These parameters were capable of providing important information for three specified response variables.In addition, PLSR model was validated by the data of the previous year, a fact, that increases further its potential for future use.The successful development of the prediction model in this study arises from the advantages of the PLSR method.In comparison to other statistical techniques PLSR can overcame the small or large number of highly collinear and noisy input variables and still provide reasonable prediction of multiple collinear plant response variables.Moreover, PLSR has greater reliability compared to other techniques (single multiple regression or combination of multiple regression with other multivariate methods) when identifying relevant variables and their magnitudes of influence, independently of the sample size used in the analysis [10].On the other hand, probably there are limitations with regard to the applicability of the suggested model to other areas or different plant species [31].Because of its advantages, PLSR method could also be ideal for the prediction of important management goals involving soil bio (chemical) parameters such as those involved in C and N cycles [30,32].

Conclusions
PLSR is highly compatible with intensive agronomic practices, such as land application, in which a large number of highly collinear and noisy input variables is monitored to assess plant species performance and detect potential impacts on soil and surrounding environment.In the present study a robust PLSR model was developed to predict important management goals in land application (i.e., plant biomass and N and P uptake), associated with treatment potential, environmental impacts, and benefits.Effluent loadings and several soil parameters commonly monitored in effluent irrigated lands, yet easily measurable, were included in the final PLSR model as predictor variables.Among these variables effluent, soil salinity (as EC), and nitrates had the greater effect on the model.PLSR model suggested in this study could help during the system design procedure and monitoring providing valuable information in terms of system performance and suggesting adjustments according to the management objectives.Monitoring of a large scale (spatial and temporal) data set of the pre-defined soil parameters can strengthen the reliability of the PLSR-drawn results, relevant recommendations, and future extrapolations.

Figure 1 .
Figure 1.Explained variation and factor loadings for X variables (a,c) and Y responses (b,d).